Fine, yes, Opteron extends the X86 architecture to true 64-bit level in a reasonably elegant manner, knowing all the quirks of the 20 yeards old platform and its zillion patch-ups over time. And it still keeps full native 32-bit performance along the way.
But what about something that is equally important in both 32-bit and 64-bit scenarios? The memory and I/O hierarchy? Is there a genuine threat for Xeon there, and does the launch of 1 MB L3 cache XeonDP surprise us then?
If you look within the CPUs, "Prestonia" Xeon has 512K L2 cache coupled with 16 KB L1 data cache plus trace cache, just like its mirror image, the Northwood Pentium 4. Opteron has a larger 1 MB cache, coupled with 2 x 64 KB L1 caches.
Outside the CPU, current XeonDP processors rely on a 533 MHz (133 MHz quad-pumped) shared FSB, while quad-capable XeonMP processors use an even slower 400 MHz (100 MHz quad-pumped) FSB. In the first case, you get 4.26 GB/s of theoretical FSB bandwidth shared between two CPUs, while in the other case, you have only 3.2 GB/s of theoretical FSB bandwidth shared between four hungry processors! Contention and congestion are two obvious words that come to mind, especially in memory hungry server and workstation applications.
On the other hand, following the Alpha EV7/EV8 design, each Opteron CPU has its own dual-channel DDR333 bus (supposedly could be made to support DDR400 as well) with 5.3 GB/s bandwidth, so every additional CPU brings in additional 5.3 GB/s of memory bandwidth, too. The HyperTransport links in between the CPUs are, at 6.4 GB/s, somewhat faster than the memory links themselves - they better be, to avoid remote memory and cache coherency penalties.
So, currently, a dual XeonDP design would have 4.3 GB/s of memory bandwidth, while a dual Opteron would top it 10.6 GB/s total memory bandwidth, quite an advantage on paper. A quad XeonMP system would have to contend with a 3.2 GB/s FSB pipe towards its memory, while a quad Opteron would have four dual-lane memory highways, reaching a total of 21.2 GB/s.
So, what to do on the Xeon side? Well, for XeonDP, Intel brought in the Gallatin XeonMP core with a 1 MB L3 cache, to alleviate the impact of shared FSB in cache-friendly applications. It does help somewhat - right now I am benchmarking it against its Prestonia cousin.
What lies beyond? Well, an obvious short-term fix for Intel is to get 800 MHz FSB working on XeonDP at least, as that would bring the bandwidth up by half, and, with a suitable low-latency chipset like Canterwood-ES, noticeably improve memory performance overall. The faster each CPU completes each memory access, the less it hogs the FSB, and less chance for it to stall the other CPU trying to access the same bus.
In the long run - well, since Mr Tanglewood is now in La Intella fixing the good ship Itanic, maybe some of the old-new Alpha-inspired superb memory interface technology can trickle down to the ubiqutous IA32 (no I didn't say IA32-64) platform? ยต