The Inquirer-Home

IBM's POWER5: The multi-chipped Monster (MCM) revealed

In detail
Mon Oct 20 2003, 07:52
BESIDES FUJITSU'S extraordinarily impressive SPARC64 VI processor, a proof that others may continue to SPARCle even when Sun appears to be setting, there was one other server processor that caught everyone's attention at Microprocessor Forum - the IBM POWER5.

See Pic of eight way POWER5 with 144MB cache revealed.

Since many have wondered how come I've talked about 144 MB cache on that cute 95x95 mm MCM shown, let's explain a bit more about the POWER5, both from IBM sources and yours' truly own crystal ball sightings.

The processor
While not as big as a 450 mm2 Itanium2 Madison, the POWER5 is nevertheless a large chip for its 0.13µ (micron) CopperSOI 8-layer process. There are over 276 million transistors in a 389 mm2 package, with a whopping 5,400 or so active pins (2313 signal, 3057 power).

All these transistors house two full independent cores, each of them 2-way multithreaded as well, for 4 logical CPUs per chip. Each core has 120 each rename registers for integer and for FP, 8 execution units and up to 5 instructions/cycle and 4 FLOPs/cycle issuing throughput, as well as I-cache of 64 KB (2-way set associative) and D-cache (32 KB, 4-way set associative). They share an ultrafast three-bank L2 Cache of 1.92 MB (3 x 640 KB caches with independent buses, 10-way set associative). Its precise bandwidth figure is not known, but should be well above 200 GB/s total. The cores also have enhanced data stream prefetching to make better use of that bandwidth.

As what many of our industry friends and users know, POWER4 is not as simple and straightforward as, say, Alpha, in terms of orthogonal execution - it has more limitations on the instruction issuing order, as it issues them in bundles (Oh, didn't I see this word in Itanium manuals too?). POWER5 tries to address many such situations where a bundle execution couldn't proceed because of some resource conflict or dependency. Its larger rename register pool will also help it achieve higher Linpack Rmax GFLOPs ratings, critical for the TOP500 supercomputer list.

Nova-holds-the-beast-at-bayAnd no, there is no Altivec (yet), although obviously adding a say, wider 256-bit vectorised Altivec for parallel double-precision FP in a say POWER5+, would make that chip not just a TOP500 supercomputer darling, but also a good candidate to up its volumes by an order of magnitude by becoming a heart of, say, "PowerMac G6 Extreme Edition" ? Or even "MultiCulturalMasterpiece (MCM) Edition"? Oh sorry, "Extraordinary Edition" sounds better - IBM & Apple, remember royalties due to me when the thingie comes out?

The buses
To help feed all these resources, and scale well up to a 64-way real SMP (or 128-way logical SMP with multithreading), POWER5 further improves on the buses towards the outside: besides the two ring buses for its neighbours on the same MCM, which now operate at full processor speed (hey, 2 GHz 256-bit bus is something to cheer about, even if it is actually two 128-bit unidirectional links!), as well as separate half-speed links to the CPUs on the opposite "book", then the other MCMs outside the book, as well as L3 cache bus, then the memory interface (DDR memory controller is on chip now!), and a GX+ I/O bus, a dedicated 6+ GB/s I/O link. The L3 and memory buses are now separate, and, while not confirmed, expect the memory bus to be at least 50% faster than on the POWER4+, i.e. something like 20++ GB/s bandwidth per chip.

The memories
Taking about memory, yes there is a L3 cache chip with every POWER4 and POWER5 chip. The difference? L3 cache on POWER5 is on the MCM now, and so its bus now operates at half CPU speed, rather than one third of CPU speed like on POWER4. So, a 2 GHz POWER5 will have its 36 MB L3 cache operate at 1 GHz - still a pretty decent figure as this should be a 256-bit bus, and the bandwidth would then be a stunning 32 GB/s, not bad for an off-chip cache! Knowing you can access it all in parallel with the 20++ GB/s memory bus, and all the data requests from other CPUs over those 250++ GB/s aggregate links... the throughput performance should be, least to say, good.

No word on the DRAM memory type supported - it is probably still a 512-bit path (8-channel DDR) but whether it is DDR333, DDR400 or even DDR2-533... well there are still some 9 months before these systems arrive, but even with DDR333 registered ECC DIMMs, this thing would still give you 21 GB/s theoretical bandwidth per chip (or 25.6 GB/s if using Opteron-type DDR400 registered ECC DIMMs) per chip. In any case, a typical 64-way SMP POWER5 system composed out of 32 chips will support 1TB RAM at start.

The package
Well, anyone has got to love this package - myself included. So, again, the MCM (MultiChipModule, not MultiCulturalMasterpiece) package includes four POWER5 chips - eight real CPUs or 16 logical CPUs using SMT), and four L3 cache chips, each at 36MB, for a total of 144MB cache, for all of you stunned by the figure when we published the photo. All those ultrafast buses between these chips run within the MCM, allowing those unheard-of clock speeds. Two of these 95x95cm MCMs can be tightly coupled into a "book" (remember those hot-pluggable "books" in IBM mainframes?), which can produce a very compact, high performance 16-way system - let's say, an ultrafast, ultradense supercomputer cluster node, the replacement for the current p655 machine?

Well, if you just repeat the system density of p655, and have two of such systems in every 4 U of rack height, there comes a 2 TFLOPs cluster with 2 TB RAM with a peak memory bandwidth of more than 2 TB / second, all in one rack, with a bit of room to spare! Just remember a proper fast interconnect with distributed shared-memory capability...

On the other side, up to eight of these MCMs, or four "books", can be put together as a single SMP machine, with expected very good scalability due to all those buses and ability from one CPU to access all the buses on all other CPUs as well. So, expect a 64-way POWER5 "Squadron" follow-on to the current p690 POWER4+ "Regatta" to be the very first system sometime, maybe the middle of next year, using this approach.

The performance
Hard to say how it would perform precisely, but my estimate, if things turn out well, is roughly 60% above the current 1.7 GHz POWER4+ in SPEC2000 benchmarks, or about 1,600 SPECint2000base, and 2,300 SPECfp2000base for a say 2GHz POWER5 - assuming that IBM really tunes the compilers to use the new features and limitation removals to the maximum. Now, these figures might be just a bit higher than the expected 1.6+ GHz Madison 9M Itanium2 at 533 or 667 MHz FSB, but this is a per-CPU figure, not counting the scalability in a large SMP.

Of course, the added features like dynamic firmware updates, ECC for on-chip paths as well, improved power management for much cooler chips, complete the list. Fujitsu SPARC 64 VI has matching and, in some cases, exceeding, reliability features, like ECC in CPU registers, for instance, but its performance might be a tad lower, at around 1,500 SPECint2000base, and 2,000 SPECfp2000base, even though the CPU is based in a 0.09 um process (still way higher rates than Sun UltraSPARC IV, though).

Anyway, 2004 is not so far away - the privileged ones might be able to run their benchmarks on POWER5 about now. In the next piece, I look at the Fujitsu's new ChipZilla.... µ

The photos
Some of the key IBM guys behind POWER5 - Joel Tendler, Balaram Sinharoy, Ravi...

POWER5 64-way SMP interconnect



Share this:

blog comments powered by Disqus
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

INQ Poll

Heartbleed bug discovered in OpenSSL

Have you reacted to Heartbleed?