The Inquirer-Home

Hypertransport co-processors take us back to the 8087

Bolt in, bolt on, or bolt hole
Sun Mar 26 2006, 08:54
THE SON of the Alpha EV7 bus, also known as Hypertransport these days, aimed to provide for very versatile functionality from Day One. That's whether we talk about cache-coherent NUMA-like SMP (but without most of NUMA latency penalties), or high-performance I/O, or "building engineering" features like tunnels and bridges (only towers are missing for now).

The current HT 2.0 version give you up to 22.4 GB/s if implemented in a maxed out, 2x32 bit 1400 MHz DDR version (although no one has actually implemented that top grade), and the upcoming HT 3.0 should double that.

Recently, two new implementation opportunities have emerged for HT: one, the HTX I/O slot, is by now well covered and, in fact, could have further use as a direct external HT cable connection between systems to, say, provide NUMA-like (or something like IBM x460) cable connection between two or more 8-socket Opteron systems into a single system image larger NUMA, with a "primary" boot box controlling the "secondary" ones and a single OS memory space throughout. Need more bandwidth? Simply use two or more connection in a "multi-channel" configuration, each channel hopefully on a different CPU in each machine for lower inter-CPU maximum latency.

Of course, AMD was way too stingy with the number of HT channels in its Opterons - I'd put at least 4 in the 800 series - so the "tunnel" approach may have to be used then where the HTX peripheral like, maybe, Quadrics QsNet III, or NUMA cabling, is in between the CPU and, say, I/O bridge on the same HT channel, so acts as a low latency HT "tunnel" to the I/O bridge as well.

More interesting is, though, the "coprocessor" approach. As our Charlie Demerjian mentioned, both FPGA gang and FP coprocessor entrants like ClearSpeed are courting AMD here. After all, HT-based FPGA acceleration with direct system memory access is nothing new - Cray XD1 machines with Octiga Bay design have it for two years already. In a way, it comes back to the old 8086/8087 pairing 25 years ago, where FP coprocessor took over the floating-point instructions and could access the memory too (the latter 80287 couldn't, for some reason - it had to go through MMU on the 80286). In this case, too, the coprocessor would have its own instruction set "extensions" of some kind, which would, of course, require re-compilation to support.

Now, the difference here is that Opteron already has SIMD FP built in, and, with K8L, it will double the peak throughput to 4 double precision FP ops per clock (same as Woodcrest Xeon) - i.e. a 3 GHz Opteron or Woodcrest will give you 12 GFLOPs Rpeak per core (24 GFLOPs Rpeak per dual-core chip) and at least half that as Rmax in Linpack FP benchmark - all that while sticking to now-standard 64-bit SSE2 or SSE3.

So, if requiring a re-compilation, the new coprocessor better give an order of magnitude better performance to justify the effort in both hardware and software. OK, let's make it half that, i.e. five times the performance. Why? Well, if the speed up is less, like say twice, the user might simply add two times as many CPUs to reach the required speed, without having to mess with new code compilation and all the associated tuning and validation headaches - and that is if he HAS the source code. The commercial apps availability for a fast incompatible platform is a nightmare - Alpha and Itanium owners know this well first hand.

The current ClearSpeed CSX600 is claimed to have 25 GFLOPs sustained performance, which is about 3 times that of 2.6 GHz dual-core Opteron currently shipping. If we extrapolate this to dual-core 3 GHz K8L Opteron early next year, with something like 20 GFLOPs obtainable Rmax Linpack FP from the chip, and apply the 5 times rule, then a commercially attractive and viable FP accelerator should give you 100 GFLOPs Rmax, coupled with extra features like much larger register sets and maybe even vectorization, plus large local wide-bus internal cache to feed all those FP units.

While architecturally very interesting in the current product line, ClearSpeed is still far from the above performance goal right now according to the public info. FPGA's may have somewhat better chance as dedicated accelerators programmable for specific tasks, yet with ultrafast I/O and system-wide direct memory access possible over HT if the design supports it. Even sped-up database or search engine accelerators are possible then. Having said all that, a sped-up version of CSX600, removing the local memory in favour of a HT-based "coprocessor" socket, and reaching the 100 GFLOPs Rmax, might be a very good revival of the "good old X87 days" - or "68882 days" for 68030-based Moto-Macintosh users. So, time to reintroduce the "co-processor" word again? µ


Share this:

blog comments powered by Disqus
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

INQ Poll

Heartbleed bug discovered in OpenSSL

Have you reacted to Heartbleed?