Jump to content
The Inquirer-Home

CPU and GPU now, the convergence goes on

Fusion or fission?
Friday, 30 October 2009, 14:00

OVER THE PAST TWO YEARS, CPU and GPU capabilities have started to converge, but at a snail's pace. While GPUs can now handle branching code and double-precision IEEE floating-point operations, we still don't have a GPU that can run generic C or Fortran code, at least not until the Nvidia GT300 comes out, or that can access system memory and support the paged virtual memory model to boot a modern OS, and not even the GT300 will be able to do that.

On the other hand, any attempts by CPUs to handle real time 3D graphics really quickly have, up to now, faltered miserably. So, neither have CPUs replaced GPUs, nor have GPUs come much closer to replacing CPUs. How far along is each camp now? And how will they perform within our crystal ball's prediction capacity horizon of, say six months from now? Let's look at the most favourite common metric between the two camps - the GFLOPS peak floating-point instruction rate, in double-precision of course.

CPU speedups

Intel's 32nm Westmere 6-core chip is the next major step in Chipzilla's roadmap. The flagbearer Westmere, in its Gulftown-EP dual CPU configuration and, a month or two later, single CPU desktop configuration, will provide 50 per cent more cores matched by 50 per cent more L3 cache at 12 MB, an improved memory controller able to support DDR3-1600MHz even as server memory by default, all within roughly the same clock speed range and die size as the current Nehalem chips.

If running at the standard non-Turbo mode 3.33GHz, the single Gulftown CPU will give you 80 GFLOPS of raw double-precision floating-point power, or twice that, 160 GFLOPS, in a dual-CPU workstation configuration. I do expect 3.6GHz parts to appear in the Gulftown stable too, before mid-2010.

While this sounds far below top numbers for the current GPUs, keep in mind this is fully general purpose floating-point for any application out there, today or tomorrow. No fancy programming tricks or new code needed. And, as usual, don't be surprised to see most of these Gulftowns doing well at north of 4GHz with even simple overclocking. How about 200 GFLOPS in a dual processor workstation by your deskside? With a grand total of six channels of DDR3-1600MHz server ECC memory, or DDR3-2000MHz desktop memory, these CPUs shouldn't be waiting for main memory data for too long.

AMD's Magny Cours, as two Istanbul dies in a single chip package, will at the same time pack twelve slower cores, probably not clocked higher than 2.4GHz to start. These two dies together will pack the same amount of total L3 cache as a single Intel Gulftown, but in theory will still be able to churn out over 110 GFLOPS of total peak double-precision floating-point power, or around 55 GFLOPS per die. Hopefully in the same time frame AMD will be able to speed up the single die Istanbul to above 3GHz, especially for the eventual desktop version.

The real floating-point throughput advances for both Intel and AMD CPUs should be seen in their end-2010 next generation cores, the Sandy Bridge for Intel - yes, the one with an integrated on-die GPU for some flavours - and the Bulldozer for AMD. Both should offer double the peak double-precision floating-point performance throughput per clock, enabling roughly 200 GFLOPS peak number munching power in a 4GHz, 12-core dual chip workstation setup, for example.

GPU advances

On the other hand, GPU priorities are a little different. Multiplying the thread counts and processing unit numbers here was more important than the power of each processing unit within the GPU, as the typical graphics pipeline is far more predictable and more parallel than most tasks run on general purpose CPUs. So, if an AMD/ATI HD5870 GPU has 1,600 simple shaders in parallel, or an Nvidia GT300 has 512 more complex and more CPU-like shader cores, the GPU looks way different from 4 to 8 CPU cores on a processor die.

Then, despite the four times slower average clock speed for the core, or three times for the shaders, versus the standard CPUs, the vast parallelism of GPUs allows far higher theoretical computational power. When it comes to the double-precision floating-point throughput we discussed before, let's look at what AMD/ATI and Nvidia might have in a few months, in the same timeframe with Intel's Gulftown and AMD's Magny Cours.

On the ATI side, a speed update for the HD5870, probably something called HD58X0, should be there with the refinement and stepping updates of the R800 family dies. If running at a default 950MHz GPU and proportionally sped up shaders, the new device should reach 3 TFLOPS in single-precision floating-point and, more importantly, 600 GFLOPS in double-precision floating-point, both IEEE compliant. In fact, some of the overclockable HD5870 entries, like those from Asus, already provide such speeds.

So, if your code can run efficiently with AMD Stream libraries and such, a dual-GPU hypothetical HD58X0 card will likely give you 1.2 TFLOPS of double-precision floating-point power for precision runs, and 6 TFLOPS of single-precision floating-point for parametrisation and estimation runs. Now, just make sure there is enough memory in there to hold the data sets of multiple threads without running over the PCIe bus to the main memory, as, despite the limited GPU caching, the slow link can cut the performance by as much as an order of magnitude. Therefore, 2GB of GDDR5 memory per GPU is strongly recommended, if doing GPU computation.

By early next year, we all hope that Nvidia's GT300 will already be launched and shipping, because if it isn't, that will be big trouble for the green graphics gang. Let's assume it does. With 512 shader processors that can do either 512 single-precision or 256 double-precision fused multiply adds per clock, that would at, say, 1.8GHz shader clock, give you 1.8 TFLOPS in single-precision mode or 900 GFLOPS in double-precision mode. Not bad at all.

But what's far more interesting is that the GT300 promises to enable a far greater range of codes to make use of all that power. With an overall architecture far closer to a CPU this time, many normal C, C++ and Fortran codes should be able to run on it out of local GPU memory. With up to 6GB of onboard memory in the first iteration, and 8GB in the subsequent one, the latter with a 512-bit memory bus, the GT300 should be quite a bundle.

What the GT300 misses to really be a true CPU and run all the usual stuff, including booting an OS, are a full fledged memory management unit (MMU), for virtual to physical memory translation, and a front-end general purpose CPU instruction set. That's why I was saying many times that Nvidia should have had a real CPU, like say the Alpha did, which would provide both ultrahigh performance better than the X86 to fill in that niche, and also offer the built-in capability to run X86 code very fast via a real-time translator like the famed FX!32 without having to pay for an X86 license.

Don't forget that the last planned Alpha incarnation, the EV9 21564, was supposed to have a kilobyte-wide (yes 8,192 bits) vector unit able to put out over 100 GFLOPS in double-precision floating-point, some 9 years ago. Imagine what would it be able to achieve today.

The Tegra and other ARM-based stuff is simply too weak to be a front end for a gigantic TFLOPS-class GPU. For a proper "fusion" at the system level, you need very fast and wide main system memory, a multi-channel multi-gigabyte setup at least, to feed it from the CPU side, and very fast multiple HyperTransport or QuickPath or Alpha EV links to connect multiple GPUs with the main CPUs for efficient coherent shared memory access between GPU and CPU memory banks. In the absence of a general purpose CPU that's able to do this, Nvidia might have to negotiate a QPI license with Intel to directly link its GPUs to the Westmere and future CPUs, in order to enable more of the coprocessor model here. But wait, wouldn't the long delayed Larrabee be gunning for the same role?

I'll have more on this, and the 'ideal' CPU-GPU system configuration, in Part 2. µ

 

Share this:

Comments
Fission

Fusion of the two at this point could only possibly result in severely worsened performance.

posted by : Baronofcheese, 30 October 2009 Complain about this comment
You're missing the point of the problem

Convergence between general purpose CPUs and GPUs is largely irrelevant because they deal with different 'problems' and it is those 'problems' that cannot be unified or converged.

Most of what the OS and typical desktop applications do is intrinsically sequential in nature and cannot be effectively parallelised. It's only when you're running workloads that are suited for parallelisation that many cores make sense and for this type of job you don't need a full instruction set: indeed, incorporating a full instruction set in the MPP hardware in GPUs is not only pointless but would require more silicon real estate, lowering the number of 'useful' cores that can be implemented.

Being able to use the MPP hardware in GPUs for _any_ type of MPP problem, instead of just graphics rendering, is useful, but adding all the extra stuff so that each core becomes a general purpose CPU is counter productive.

posted by : LeeE, 30 October 2009 Complain about this comment
Fusion

@LeeE
I agree, which is exactly why M-Space makes me excited.

I think there is a middle road. While the many-small-unbranchy-processors route is the most efficient for certain problem sets, there is a trade-off to be made in terms of development effort for those types of processor.

The current trend seems to be towards moving GPU processors to a certain level of programmability, where they become less challenging to code for, but retain their parallel advantage.

In other words, we're currently at a point where it makes sense to spend transistors on increasing
"codeability" rather than the purest parallel-graphics performance. This has the advantage of bringing huge performance increases to certain types of problem, that would otherwise never have been coded for these parallel architectures at all.

So, yes, there is a convergence, but only to a point. The monolithic core isn't going anywhere either: we'll always have some branchy, unparallel code.

The question is how much.

posted by : Benji, 30 October 2009 Complain about this comment
Nebojsa Novakovic=Charlie....

Nebojsa Replace Charlie? Hummm. CPU Parts are like TOY Chest. Grab heart out, SomeLungs & little Larnyx & In Business.

Obviously, if CPU where SEX toy, It'd Be Perfected by Now. O.K., Heres How for REAL.

Teletransport todays Chips BACK In Time, Say 50 years Ago. then that Changes todays Chips into Better chips, as timeline is speed up. Repeat Until ULTEE' RULES.

drashek

posted by : Sliding Point rules...., 30 October 2009 Complain about this comment
How many iterations

You're putting a great many forks in the road, Nebojsa Novakovic.

Shouldn't someone be concentrating on how best to execute a present code-stream on any given system configuration?

I'll try it this way and then that way until I get the right way. But will it still be the right way tomorrow?

The breakthru will be an optimising comparator-translator and compiling operations architecture.

Is there any word yet on the positronic brain headers found in Roswell?

posted by : For what does it profit?, 30 October 2009 Complain about this comment
i dont care for this...

i want my jaggies as smooth as butter on my 37 inch screen before this stuff. .<

posted by : super dude, 30 October 2009 Complain about this comment
GPU GFlops

I tried a stream (ATI's GPU math acceleration thing) app the other day and got 550GFlops on my 'old' HD4850

Application I refer to:
http://galaxy.u-aizu.ac.jp/trac/note/wiki/Astronomical_Many_Body_Simulations_On_RV770#DemoProgram

Just to show you what a GPU can do.

posted by : W.-, 30 October 2009 Complain about this comment
Heat

Tell me about the heat, how do you deal with the heat, more power and GFLOPS the more heat, how would they run at 40c ambient temp?

posted by : Ed, 31 October 2009 Complain about this comment
No thanks.

With all that extra heat, we'll end up with fewer CPU cores or less clock speed. And how do you arrange them, anyway once you start to scale up? 1 large gpu on the side + 4 CPU cores in a square? 2x2 with one of the cores a GPU? 4x2 with two of the cores GPUs? 3/3? We may end up sacrificing cores for GPU die space.

What if I want high CPU power, but no GPU power at all? Won't all the high end consumer CPUs eventually have GPUs in them? And will we end up with a driver headache when the big idea of hybrid card/built-in/GPU-on-chip power comes around?

posted by : Mark Green, 31 October 2009 Complain about this comment
Nviadia? Nvidiarm?

I've been saying for years that Nvidia should buy VIA.
NVIADIA ;)
However, a new possibility appears to be coming of age.
All of the ingredients are there, maturing like wine, till they might just be able to make a nice meal of it.

ARM.
Sure, it's not that fast, but it's a CPU architecture with lineage. Developers have knowledge of it, and it has market share.

Now, team Nvidia up with some developers to create an all encompassing open API and the suite to go with it that is directly competing with DirectX. Think OpenGL + sound + input.

Who benefits?
Every smartphone maker.
Apple OSX + iphone. (games on a mac?! *gasp*)
Google Android.
Nintendo.

Want convergence?
Make the API.

Lets face it. Most of us dont need what the new x86's have to offer anymore.

Of course, Nvidia could simply add ARM into their core design. This would be especially interesting for netbooks that are already turning towards ARM.

If the chipset itself was a processor, then it could operate seamlessly in low power ARM mode and switch to X86 mode when needed - or even on budget models, not have the option for x86 at all.
With virtualisation having already been mostly mastered, there's nothing stopping this.

Forget virtualising OS'. Imagine Alt-Tabbing between CPU architechtures!

This is already somewhat in motion.

There are ARM based NIC's that can download torrents while the x86 motherboard is off.

It has another interesting possibility.
Security.
The ARM core would be invulnerable to x86 viruses. ARM mode could provide fool-proof virus scanning, firewall, etc.

What if, instead of virtualising the browser to protect the OS, you just get the browser to open in the ARM OS?

Nvidia have a great opportunity here.

With phones and consoles and netbooks allready using or heading that way, they could use ARM to flog their graphics/chipsets.

posted by : myne, 01 November 2009 Complain about this comment
More precision

I dont know if I qualify as a real HPC user, but what I need is hardware support of higher precision computations. All the single and double performance increases are meaningless to me when I have to rely on code and tricks that are an order of magnitude slower than hardware. 128 bit would be nice, 256 bit precision would be better,

posted by : node, 01 November 2009 Complain about this comment
What about Larrabee??

Hello? Can someone tell me how intel's larrabee fits in this picture and when it was supposed to be out?

posted by : Olternaut, 02 November 2009 Complain about this comment
Advertisement
Subscribe to the INQ Newsletter
Sign-up for the INQBot weekly newsletter
Click here to sign up Existing user
Advertisement
INQ Poll

Nvidia Fermi

Will graphics cards built with Nvidia's Fermi GPUs be a hit?