I don't know what a monopoly is until someone tells me - Steve Ballmer
See Pic of eight way POWER5 with 144MB cache revealed.
Since many have wondered how come I've talked about 144 MB cache on that cute 95x95 mm MCM shown, let's explain a bit more about the POWER5, both from IBM sources and yours' truly own crystal ball sightings.
The processor
While not as big as a 450 mm2 Itanium2 Madison, the POWER5 is nevertheless a large chip for its 0.13µ (micron)
CopperSOI 8-layer process. There are over 276 million transistors in a 389 mm2 package, with a whopping 5,400 or so
active pins (2313 signal, 3057 power).
All these transistors house two full independent cores, each of them 2-way multithreaded as well, for 4 logical CPUs per chip. Each core has 120 each rename registers for integer and for FP, 8 execution units and up to 5 instructions/cycle and 4 FLOPs/cycle issuing throughput, as well as I-cache of 64 KB (2-way set associative) and D-cache (32 KB, 4-way set associative). They share an ultrafast three-bank L2 Cache of 1.92 MB (3 x 640 KB caches with independent buses, 10-way set associative). Its precise bandwidth figure is not known, but should be well above 200 GB/s total. The cores also have enhanced data stream prefetching to make better use of that bandwidth.
As what many of our industry friends and users know, POWER4 is not as simple and straightforward as, say, Alpha, in terms of orthogonal execution - it has more limitations on the instruction issuing order, as it issues them in bundles (Oh, didn't I see this word in Itanium manuals too?). POWER5 tries to address many such situations where a bundle execution couldn't proceed because of some resource conflict or dependency. Its larger rename register pool will also help it achieve higher Linpack Rmax GFLOPs ratings, critical for the TOP500 supercomputer list.
And no, there is no
Altivec (yet), although obviously adding a say, wider 256-bit vectorised Altivec for parallel double-precision FP in a
say POWER5+, would make that chip not just a TOP500 supercomputer darling, but also a good candidate to up its volumes
by an order of magnitude by becoming a heart of, say, "PowerMac G6 Extreme Edition" ? Or even "MultiCulturalMasterpiece
(MCM) Edition"? Oh sorry, "Extraordinary Edition" sounds better - IBM & Apple, remember royalties due to me when
the thingie comes out?
The buses
To help feed all these resources, and scale well up to a 64-way real SMP (or 128-way logical SMP with
multithreading), POWER5 further improves on the buses towards the outside: besides the two ring buses for its
neighbours on the same MCM, which now operate at full processor speed (hey, 2 GHz 256-bit bus is something to cheer
about, even if it is actually two 128-bit unidirectional links!), as well as separate half-speed links to the CPUs on
the opposite "book", then the other MCMs outside the book, as well as L3 cache bus, then the memory interface (DDR
memory controller is on chip now!), and a GX+ I/O bus, a dedicated 6+ GB/s I/O link. The L3 and memory buses are now
separate, and, while not confirmed, expect the memory bus to be at least 50% faster than on the POWER4+, i.e. something
like 20++ GB/s bandwidth per chip.
The memories
Taking about memory, yes there is a L3 cache chip with every POWER4 and POWER5 chip. The difference? L3 cache on
POWER5 is on the MCM now, and so its bus now operates at half CPU speed, rather than one third of CPU speed like on
POWER4. So, a 2 GHz POWER5 will have its 36 MB L3 cache operate at 1 GHz - still a pretty decent figure as this should
be a 256-bit bus, and the bandwidth would then be a stunning 32 GB/s, not bad for an off-chip cache! Knowing you can
access it all in parallel with the 20++ GB/s memory bus, and all the data requests from other CPUs over those 250++
GB/s aggregate links... the throughput performance should be, least to say, good.
No word on the DRAM memory type supported - it is probably still a 512-bit path (8-channel DDR) but whether it is DDR333, DDR400 or even DDR2-533... well there are still some 9 months before these systems arrive, but even with DDR333 registered ECC DIMMs, this thing would still give you 21 GB/s theoretical bandwidth per chip (or 25.6 GB/s if using Opteron-type DDR400 registered ECC DIMMs) per chip. In any case, a typical 64-way SMP POWER5 system composed out of 32 chips will support 1TB RAM at start.
The package
Well, anyone has got to love this package - myself included. So, again, the MCM (MultiChipModule, not
MultiCulturalMasterpiece) package includes four POWER5 chips - eight real CPUs or 16 logical CPUs using SMT), and four
L3 cache chips, each at 36MB, for a total of 144MB cache, for all of you stunned by the figure when we published the
photo. All those ultrafast buses between these chips run within the MCM, allowing those unheard-of clock speeds. Two of
these 95x95cm MCMs can be tightly coupled into a "book" (remember those hot-pluggable "books" in IBM mainframes?),
which can produce a very compact, high performance 16-way system - let's say, an ultrafast, ultradense supercomputer
cluster node, the replacement for the current p655 machine?
Well, if you just repeat the system density of p655, and have two of such systems in every 4 U of rack height, there comes a 2 TFLOPs cluster with 2 TB RAM with a peak memory bandwidth of more than 2 TB / second, all in one rack, with a bit of room to spare! Just remember a proper fast interconnect with distributed shared-memory capability...
On the other side, up to eight of these MCMs, or four "books", can be put together as a single SMP machine, with expected very good scalability due to all those buses and ability from one CPU to access all the buses on all other CPUs as well. So, expect a 64-way POWER5 "Squadron" follow-on to the current p690 POWER4+ "Regatta" to be the very first system sometime, maybe the middle of next year, using this approach.
The performance
Hard to say how it would perform precisely, but my estimate, if things turn out well, is roughly 60% above the
current 1.7 GHz POWER4+ in SPEC2000 benchmarks, or about 1,600 SPECint2000base, and 2,300 SPECfp2000base for a say 2GHz
POWER5 - assuming that IBM really tunes the compilers to use the new features and limitation removals to the maximum.
Now, these figures might be just a bit higher than the expected 1.6+ GHz Madison 9M Itanium2 at 533 or 667 MHz FSB, but
this is a per-CPU figure, not counting the scalability in a large SMP.
Of course, the added features like dynamic firmware updates, ECC for on-chip paths as well, improved power management for much cooler chips, complete the list. Fujitsu SPARC 64 VI has matching and, in some cases, exceeding, reliability features, like ECC in CPU registers, for instance, but its performance might be a tad lower, at around 1,500 SPECint2000base, and 2,000 SPECfp2000base, even though the CPU is based in a 0.09 um process (still way higher rates than Sun UltraSPARC IV, though).
Anyway, 2004 is not so far away - the privileged ones might be able to run their benchmarks on POWER5 about now. In the next piece, I look at the Fujitsu's new ChipZilla.... µ
The photos
Some of the key IBM guys behind POWER5 - Joel Tendler, Balaram Sinharoy, Ravi...
POWER5 64-way SMP interconnect
POWER5 CPU die