The Inquirer-Home

IBM POWER6 sub torpedoes Itanium Montvale cruiser

Or will Intel's Tukwila missile sink the sub?
Mon Feb 27 2006, 10:15
SINCE its ISSCC talk early this month, IBM POWER6 did capture the imagination of quite a few processor buffs around there - not me, yet, as I'm used to see great CPUs failing since the Moto 68K and Alpha days - and also raise debates on several tech forums.

While IBM still keeps most of the official POWER6 data closely guarded - I'd expect more during Hot Chips or Fall Processor Forum some six months from now - there is enough right now to assemble a rough picture of what may be in for the users of the new chip. Keep in mind, of course, that everything is speculation till the official announcement, as usual.

On the other end, Itanium is the only competitor to POWER right now at the high end non-X86 general purpose segment, and there are plenty of news there, too. With Montecito only appearing in volumes in the middle of this year, the Montvale 65 nm shrink will be out there only in 2007 - unless another change of plans hits it. Montvale should, hopefully, hit or exceed the 2GHz mark originally envisioned for Montecito, however that may not be enough if IBM really manages to ship POWER6 in the same 65nm geometry, that same year.

Why? Well, besides the expected 4GHz starting frequency, most of the sources claim that there will be no IPC reduction for the POWER6 cores vs POWER5, more likely there will be slight execution improvements, on top of nearly twice the clock rate compared to the current 2.2GHz POWER5+. After all, some 750 million transistors in the upcoming chip (nearly triple the 276 million POWER5 transistors) should provide more than enough space for the improved cores plus well over double the caches - in fact, I wouldn't be surprised to see even dual 4MB secondary on-chip caches, individual per core, as well as dual memory controllers.

The most often mentioned points are keeping the current pipeline length plus/minus a few stages while doubling the clock speed, scaling the cache, memory and I/O buses speeds proportionally to the frequency (so everything will be twice as fast both inside and outside - except probably main memory latency...), distributed clock for lower power at a given performance, gate delay reductions and less logic delay per pipeline stage needed to complete the operations. Also, the zSeries mainframe custom ops might be "microcoded" here to assist migration to a common processor core for all IBM non-PC server products, the "z", "p" and "i" series - a project known as "ecli pz".

Now, let's assume these are right on the spot, and that clock rate is 4 GHz to start with. The shared L2 cache is assumed to be 4 MB, twice the POWER5+, but still 6 times less than, say, 2.4GHz Montvale with on-chip 24MB shared L3 cache (which of course, may or may not compensate for its vastly slower external fabric compared to the POWER6 - keep in mind that POWER5+ external 36MB L3 cache uses a separate bus from its main memory, and that, if POWER6 continues using external L3, it might be at least 64MB in size).

The Itanium 2007 flagship executed two bundles of three instructions per cycle (far from sustained, of course), while POWER6 should still execute four instructions per cycle, with the sustained rate not far from that due to out-of-order execution. For peak FP operation, both should execute two fused mul-add ops per cycle (9.6 GFLOPs Rpeak per Itanium core, 16 GFLOPs Rpeak per POWER core).

If we follow current Linpack Rmax benchmark results, the Itanium should get a bit above 90% of Rpeak in the measurable (Rmax) number after all the optimisations, i.e. 8.8 GFLOPs Rmax per core, while POWER should obtain around 75% as Rmax, i.e. 12 GFLOPs per core. So, in Linpack Rmax (useful in supercomputing tender bids, but usually useless in real use), 4GHz POWER6 should be around 35% faster per core than 2.4 GHz Montvale.

What about the bandwidth race? If we assume 51.2GBytes/s main memory bandwidth with DDR3-1066 per dual-core POWER6 chip (twice that of current POWER5+ in the IBM P5-575 using DDR2-533), compared to 12.8GBytes/s on the 800MHz FSB dual-core Montecito/Montvale, IBM wins hands-down. But that's not all. The inter-processor, inter-MCM and L3 cache buses are separate on the POWER5, and seemingly will be so on the POWER6 too.

Expect around 64 GB/s dedicated L3 cache bandwidth (twice that of 2GHz POWER5+), and, if sticking with the eight chip, 16-core dual-MCM "book" approach from POWER5, each dual-core chip should have another four buses (full-clock speed within MCM and half-speed outside MCM) for an extra 200+ GBytes/s of inter-CPU bandwidth. Of course, the Montecito L3 cache is internal, and probably will have lower latency even if the bandwidth is similar.

I do expect that Intel will try to push the 128-bit Montvale FSB to 1066 MHz by then anyway, giving it 16.7GBytes/s throughput - provided there is a chipset to support it. A 2.67MHz 1066 FSB Montvale would be somewhat closer to POWER6 in FP, at least. Keep in mind, though, this would only be the case in the one-CPU per FSB situation, where there is no FSB sharing among multiple CPU chips - like the case with Blackford platform for the Woodcrest X86 server chips.

SPEC2000 benchmark, if still around next year, should give us (peak rates) around 3000 SPECint and 5200 SPECfp per 4GHz POWER6 core, while 2.4GHz Montvale should give us around 2400 SPECint and 4300 SPECfp per core - not that far off, either, especially since, for now, Montvale's rates are a bit more predictable than those of POWER6. Simply, more is known about Montvale (being just an improved Montecito) then about POWER6 at this point.

In summary, at the first glance, POWER6 does have a solid chance of taking the unquestionable server CPU - and possibly overall per-core CPU unless successors to Woodcrest and K8L upset it - performance throne when it arrives in some 18 months from now, give or take a month.

The only 'spoiler' would be an early release of quad-core Tukwila Itanium, with much faster CSI or, as a replacement, HyperTransport 3, interconnect protocol and on-chip memory controllers thrown in for scalability. Right now, though, this is highly unlikely to happen in 2007 - even though I'd hope to see it, at least the battle would be far more interesting. More on this as things develop. ยต

Share this:

Comments

There are no comments submitted yet. Do you have an interesting opinion? Then be the first to post a comment.

aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

Facebook starts selling shares

Will you buy Facebook shares?