The Inquirer-Home

It'll be Sandy Bridge against Bulldozer in 2011

Analysis The tightest Intel versus AMD performance battle in a long time
Tue Aug 31 2010, 14:36

OVER THE PAST FEW WEEKS more details about Intel's and AMD's next microarchitectures - Sandy Bridge and Bulldozer - have become clear. Including, for the first time, the high end parts slated for deep into 2011, likely mid-year.

While the mainstream LGA1156 Sandy Bridge is fairly clear by now, down to the model numbers and expected performance, its top notch brethren in the brand new Socket LGA2011, aptly named to match the release year, have far more impressive specifications.

Eight full cores - no sharing of FPUs and such, but eight true full cores - 20MB of shared L3 cache, and quad channel DDR3 memory on a single 32nm process die, and with clock rates similar to the quad-core Sandy Bridge parts at launch, bring out a possible performance monster. If you estimate an average 15 per cent clock for clock performance boost per core - and that is without using AVX instruction extensions - plus two more cores and at least 5 per cent higher clock speed compared to the current 3.33GHz top end LGA1366 processors like the Core i7 980X and Xeon X5680, you'll easily get over half extra peak performance right at the launch.

Even if we include the year-end expected Core i7 990X and Xeon X5690 3.46GHz part speed bin step-up for the current Westmere generation, the new chips will still have at least the same clock speeds to start out with. And, looking at the 3.4GHz starting speed bin for the initial Core i7 2600 quad-core Sandy Bridge part, I am inclined to expect at least 3.6GHz launch speed for the octo-core high end parts two quarters later.

As an example, a 3.6GHz highest end Sandy Bridge based dual Xeon workstation would, with its 16 total cores and AVX set, be able to churn out an astonishing 460GFLOPs in double precision floating-point, compared to roughly 160GFLOPs on a dual Westmere X5680 3.33GHz Xeon without AVX extensions. If the 3.8GHz Turbo mode kicks in across all cores, we'll be quite close to a peak half teraflops on a desktop. Not bad at all, and it should provide something for the computational GPU crowd to think about.

The benefits of extra DRAM bandwidth and capacity via four DDR3 channels, all fed through a humongous 20MB L3 cache, should be felt especially in memory and cache intensive codes, as many more loops will fit within the enlarged cache without much outside traffic. On top of that, the Sandy Bridge designers have optimised the L3 cache latencies, too.

On the other side, AMD also has a new horse to show off. The Bulldozer-based Interlagos replacement for Magny Cours, with a total of eight dual-core blocks, provides for 16 integer cores with eight shared floating-point units.

While the AMD intended approach was to enable the common thread pairs - normally one integer and floating-point, and another integer-only - to be paired nicely across such cores without wasting the die size, it could impact scientific apps where all cores might be loaded with floating-point tasks. Since the single die 4-block 8-core Bulldozer should run at 3.2GHz and above clocks, similar to the current six-core Phenom or Opteron, the dual die 8-block 16-core Interlagos shouldn't be far behind, probably around 2.6GHz at start.

Keeping in mind the well publicised scheduling and execution path improvements and around 60 per cent expected performance boost when going from 2.3GHz Magny Cours to 2.5GHz Interlagos - or extrapolate it to a similar 60 per cent performance gain from the 3.2GHz Phenom II X6 to the 3.5GHz Bulldozer part - AMD should be on the route back to the performance race this time.

There is another potential boost for AMD here, which might have passed forgotten in the mists of time. A few years ago, AMD was toying with a kind of 'reverse multithreading' approach, where instead of two threads sharing a single core like in typical multithreading, there was a consideration to enable one very resource demanding thread to be able to share two cores. The otherwise fairly complex problem becomes much simpler now, if the two integer cores within each Bulldozer block share the same instruction fetch logic.

Why is it important? Well, looking at things the usual way Intel would still hold the per-core and per-thread performance lead. The expected LGA2011 Sandy Bridge simply won't have competition in that realm yet. However, if AMD was to somehow enable reverse multithreading on the Bulldozer, allowing a single thread to use all of the dual-core block resources at once, we might, for the first time in a while, see AMD take the lead in per-thread performance too, especially the integer-rich ones.

The question is, will that happen at launch? We still have over half a year to figure it out. Either way, the performance competition will be the most interesting in years. µ

 

 

Share this:

Comments
wtf....

Bulldozer will likely do 5.5 ghz on high end air @ddr3 2000 easy.

It should be a no brainer. Basic High end Intel setup.

Or High end bulldozer setup + SSD + beter GPU for the same money.

Cmon use your brains. unless you have money to burn or some very specific computing needs you will end up with a better system going AMD.

Crossfire or SLI it dosen't care, just plug em in.

Use the money saved and buy an SSD or a better video card and LoL as you rape em with a Radeon 7990.

posted by : grndzro, 16 September 2011 Complain about this comment
apollothesun1@juno.com

Talk about a Intel fanboy, this article has it's ice cream all over it. Might as well just say it loud n clear. AMD doesn't stand a chance against Sandy. Im not a engineer but give AMD a chance. This might be it, the one to put em head to head for the years to come. Bulldozer should be a very positive surprise, cant wait.

posted by : skopas, 14 February 2011 Complain about this comment
Terribly biased one sided article favoring Intel

Why do you assume that clock speed of bulldozer will be closer to Opterons. Bulldozer is a brand new architecture with different kind of specifications and looks nothing like opteron architecture. You havent even seen the performance numbers yet/

Why do you say FPU is shared? for 128-bit FP applications each BD module has two FP units corresponding to two integer cores.

Reverse Multi-threading? Really? Never heard of that technical jargon. There is nothing called reverse multi-threading. The very fact that you could find parallelism in one process, spawns the idea of thread. The OS will always associate one thread with one core. In reality one application process might have 4 threads for example simultaneously executing in 4 cores. What ever it means reverse Multi-threading.

20MB L3 cache looks like a lot. The current westmere core high end version probably has a max of 12MB L3 cache. How come in the same process with no major tweak in the architecture they could get lower latency for another 8MB? This one beats me. I guess the extra cache might be because of the two extra cores of SB compared to current max of 6 in one die. But what does this do to die area? How efficient is that to yield that kind of chip? Why there is no comment in this regard?

posted by : jollyjugg, 15 November 2010 Complain about this comment
Terrible article

"As an example, a 3.6GHz highest end Sandy Bridge based dual Xeon workstation would, with its 16 total cores and AVX set, be able to churn out an astonishing 460GFLOPs in double precision floating-point"

Ah, pointless number guessing. That's useful.

"The Bulldozer-based Interlagos replacement for Magny Cours, with a total of eight dual-core blocks, provides for 16 integer cores with eight shared floating-point units."

No idea why you decided to invent a new substitute for the word "module", nor why any tech writer in their right mind finds it necessary to refer to a Bulldozer chip by both it's module count AND core count (way to be redundant).

I'm also not sure why you specify it as having 8 shared FPUs, considering that this will only be the case with 256-bit instructions. For everything else, the CPU has 16 FPUs.

"However, if AMD was to somehow enable reverse multithreading on the Bulldozer, allowing a single thread to use all of the dual-core block resources at once, we might, for the first time in a while, see AMD take the lead in per-thread performance too, especially the integer-rich ones."

I suggest you ask AMD about this, and see how much they laugh at you. I can't believe that you actually printed this.

"The nice thing about bulldozer, is *there is a dedicated vector processor* in the chip via the CPU+GPU merge (fusion)."

Bulldozer has nothing to do with Fusion.

posted by : Adam, 05 September 2010 Complain about this comment
Sandy Grudge Best....

First Main improvement is Herald'd, Vector Extensions, Vx. Bulldozer also has Vx. However, Sandy has VxVt or Vertulization of Vx, Deeper Mapping of NEW Item.

Certainly there is Massive Differences in Two New Models. InterLargos Vs Sandee'.

SB Seems to Have GPU on Die thats Paned out much more powerful than first skeptical pundits hailed. Just cost of 5450 saves bundle on sandy & since more gamers may not be possible, family can be satisfied Best CPU ,for them, in Business.

Next Intel does Hyper Threading ,While AMD Does Hypertransport. HyperThreading Is Much Better, Intel Processors work much smoother than Hyper Transport. If Os Stack Build is bit shakey & sticky, Hyper Threading will make that less, eventually, anybuild can be corrupted.

Hyper transport will always have glitches & problems.

Sacmo Sandy, Folk legend for Family, perhaps even lower performance point cost, given $150 SB Can beat Intels Extreme Processor family, With Ease. 2800, yet to see specs, might be Monster of Good Thing.

Filled with Benifits & Kindness. Bulldozer has Server in 2 core modules, while desktop has 3 core modules, as Eva Stateed here Yesterday. Both Good for Specific Uses. Mobile has ity bity chips, first seen yestrday at Global Foundry Exhibition.

What Better Choice Could Public Want. drashek for Hon President. Yes,yes?

posted by : Sandee' Scher DMV Executive...., 02 September 2010 Complain about this comment
Reverse Multi-threading.

"A few years ago, AMD was toying with a kind of 'reverse multithreading' approach..."

That was nothing but an urban legend. Pls. dont revive that again.

You are spending 1/3 of the article on it, and do you have credible sources Mr. Journalist?

posted by : Jon Moller, 01 September 2010 Complain about this comment
AMD: All SB instruction sets + more

If I'm not mistaken this will be the first time(?) that AMD will supports all instruction sets in intels CPU (SB) AND add some new of there own.

So if you like more instruction set then AMD will be your choice.

posted by : kedas, 01 September 2010 Complain about this comment
Floating point performance in scientific applications

The nice thing about bulldozer, is *there is a dedicated vector processor* in the chip via the CPU+GPU merge (fusion).

If you are doing serious scientific number crunching (i.e. 8 threads worth), it will have been an optimized implementation to take advantage of 8 threads. Perhaps it would be better to optimize for the kick-ass GPU instead?

I think this 8-FP core performance stuff is a little bit of FUD.

posted by : Guest, 01 September 2010 Complain about this comment
FFS I was going to get an Intel i7 930 but not now, lol.

The 1366 socket is going to be history very soon. Is 2011 enough pins?

Will the AMD chips work on Socket AMD3+ or whatever it's called?

I like the AMD thinking, and it seems a more obvious performance gain. In fact why are 4 cores appearing as 1 core to software?

posted by : interested_party, 01 September 2010 Complain about this comment
So from those clock speeds

We can take it that 2.5 tops is the most we can get from any X86 cpu at the moment without exotic cooling ?

this is why we need a new archtecture

posted by : LPF, 01 September 2010 Complain about this comment
Better design choice (AMD)

Although we will only be sure if we see the numbers but AMDs approach to something simular as Hyperthreading from intel make much more sence.
The Integer cores are the threads (not shared)
and the FP core is shared like a co-processor.
and this all for a very small die size increase.
(most work of a CPU is integers)

posted by : kedas, 01 September 2010 Complain about this comment
Sources...?

Interesting article and interesting glimpses.
But with no sources these is mere speculation or at best logical inference.

Unfortunately, Sandy Bride B2 (i.e. top level Desktop) info is VERY thin on the ground. In fact even LGA2011 is not confirmed officially yet.

I do hope all you write is true but I can't quite believe it till I see it backed-up with some credible sources.

posted by : Ex_Pat, 01 September 2010 Complain about this comment
aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

The Pirate Bay poll

Will UK ISPs blocking of The Pirate Bay stop you from using it?