The Inquirer-Home

AMD, Intel are come-back kids with X86 vectorisation

Plus ca change, cache no questions
Thu Oct 26 2006, 09:11
A LONG time ago - but probably not in a galaxy far away - there were many vector supercomputers, with extemely parallel floating point engines, often executing more than 32 FP ops per CPU core per cycle.

Remember the Cray 3 (very dead) and NEC SX series (still alive)? These were and are big systems, both in size and price. However, the basic benefit of vector processing - those parallel FP engines - has found its way into the microcomputer mainstreams. How?

Three main approaches: via custom specialised accelerators all those FPGAs, Clearspeed, RIKEN MD-GRAPE and similar add ons, via CPU extensions - such as the Alpha EV-9 proposal in 2001 - and, of course, through the GPUs for both ATI and Nvidia.

This coming year, the buzz will be back on this front - and it is going far beyond the old MMX/SSE-like SIMD approach, where a single instruction maybe operates on up for four sets of operands. Now, we talk about at least 16 parallel 64-bit FP units in the proposed vector engine of the next Intel core, or 64 pipelines in, say, the ATI R600 GPU, usable as a FPU, too.

What would such a vector unit bring to a, say, 4GHz follow-on core in Nehalem CPU two years down the line? Sixteen multiply-add dual ops per cycle at 4GHz gives you 128 GFLOPs peak FP throughput, or 12 times the per-core throughput in the current 2.66 GHz Kentsfield processor.

Sounds fantastic! You could now do gaming physics simulations and even true real time ray tracing 3-D within the CPU. But wait, how do you feed the unit with sufficient data to come anywhere close to that peak throughput? Assume each operation ( A x B + C = D ) needs three 64-bit operand input to output a 64-bit result, per cycle (the result comes out later, but assume that the result output from previous operation is pipelined with the operand inputs for the next operation).

That is, well, 256 bits (32 bytes) per operation, multiplied by 16 for the number of parallel units. In total, four kilobits, or 512 bytes, per cycle. Multiply this now with four billion cycles per second, and you got a really cute number of two TERABYTE PER SECOND required cache bandwidth to feed this! The Alpha EV-9 "Tarantula" architecture in 2001, some of whose creators now labour at Intel on these vector things, envisioned a nifty solution: aside from small CPU-oriented general purpose L1 and L2 caches, the CPU could have a for that time whopping 16 megabyte L3 cache, feeding the vector unit directly with bandwidth matching the peak FP throughput of the units. Yes, the latency of this cache would be higher, but that is OK for the streaming, heavily in-order vector FP operations on large arrays of data.

Something like that could be applied to the future X86 CPU with such a vector unit. The drawback is that, for applications that thrash the cache, the external memory bandwidth becomes the limit for achievable performance. Let's say that a Nehalem CPU has an integrated memory controller with, say, dual-channel DDR3-1600 memory (25.6 GByte/s bandwidth at peak). That's still nearly two orders of magnitude less than the peak throughput required to feed that vector unit continuously - and that, is vector unit of one core. After all, Nehalem is supposed to have at least four cores per die, or half a teraflop peak.

If, as some rumours say, Intel goes back to Rambus ultimately, for the technically excellent XDR memory, then we could talk about roughly 100 GB/s of main memory bandwidth for the same pin count as DDR3 gives us in 2008. That, together with further advancements in prefetching techniques, could make more of the vector unit's potentials realisable in practice, across a wider range of apps - both cache-friendly and memory-intensive. The idea is to get close to the maximum for at least some programs, and at least twice the speedup over normal FPU for many more general apps.

How does it compare to the GPU approach, whether Nvidia or ATI, but obviously pushed first by AMD+ATI combo? Well, first of all, GPUs do have many, many more parallel pipelines than CPUs - but at much lower clock speed. By late 2008, I believe that most high end GPUs from Nvidia or ATIMD will have at least 256 unified pipelines, each running at around a gigahertz (and taking a kilowatt of power, too?). If each does a multiply-add, you again come to the same 512 GFLOPs peak number, just like in the above hypothetical vectorised Nehalem case. Now, if you have a GPU connected to your CPU via HyperTransport 3 or CSI, you can get a 40+ GB/s connection to the CPU and its memory array, while at the same time having your own graphics memory system for, say, 2 GB memory at well over 150 GB/s if using GDDR4 then, or over 250 GB/s if using whatever XDR is available then. After all, GPU memory buses can be up to 8 times faster bandwidth-wise than the CPU ones - compare a, say, February 2007 situation with a 512-bit 2 GHz ATIMD R600 GDDR4 memory (128 GB/s peak) vs ATIMD Athlon64 FX 128-bit DDR2-1200 best overclocked case (19.2 GB/s peak), and it speaks for itself.

So, GPUs can in theory have way higher total bandwidth available to them to feed all those processing units - and, with their huge transistor budgets combined with slower clock rates, it is easier to deploy a wide, not too fast, very large cache on chip to act as a full-speed buffer between the processing units and external memory, feeding them with streaming data at peak throughput.

Problems? An obvious one - the coprocessor model is long gone from the PC programming world, so the ability to program an external coprocessor as an extension of CPU instruction set needs to be promoted all over again. You may have to rely on dedicated libraries to offload the work from CPU to GPU instead of direct extension of CPU instruction set, especially if the GPU doesn't share a common memory management model with the CPU.

All this means a need for two separate sets of binaries - one for systems without GPU co-processing, another for those with it. Also, what if one system uses Nvidia, the other ATI? Yet another new binary required.

In summary, whether for gaming physics, black hole research or financial simulations, a combination of multi-core and vector processing will bring PCs close to the teraflop performance, and most probably cross the teraflop peak speed barrier by 2010 - whether Intel decides to introduce the feature earlier in Nehalem, or wait till Gesher in that year. In a sense, both CPU and GPU approaches can be combined anyway, as they don't exclude each other. At the end, that CPU vector unit could become a core of its on-chip high-end GPU too, couldn't it? µ

Share this:

Comments

There are no comments submitted yet. Do you have an interesting opinion? Then be the first to post a comment.

aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

Authorities in several countries raided Megaupload recently, shut down all of its services, seized hundreds of servers and arrested several of its executives on criminal charges.

Do you think the move was justified?