However, even then, it is being pushed to its limits speed-wise, partly to feed the hungry quad-core dual-die packages, and also to address AMD memory benchmark wins to certain extent. After all, it started with Pentium 4 FSB400, to reach Penryn's FSB1600 - and expected to perform stable well beyond FSB2000 on top-end boards with matching cooling.
As Charlie reported yesterday, Intel has taken another step forward: with their new QuickAssist technology and AAL (Accelerator Abstraction Layer) it opened up that FSB to others, for the first time - partly to offset the AMD Torrenza initiative (PCI-E on Geneseo simply can't do tight X87-like coprocessor-level integration as well as HyperTransport or CSI or plain FSB can do - think cache-coherent shared memory?), and partly to sell more of those expensive, upcoming Caneland four-socket, four-FSB chipsets.
Xilinx Virtex 5 high-end FPGA is the first one to have that FSB cell in-built, ready to plug into the socket 604 or LGA771. Both Xiling and Intel guys on the IDF Beijing booth were confident that FSB1066 is a done deal, but no confirmation if, say, I could use one at FSB1600 speed levels in the second CPU socket of the dual-FSB SeaBurg chipset.
In my mind, there are three possible uses for this socket - knowing well it's basically a 'pilot' test till the CSI versions sit in 18 months from now: computation (mainly FP) accelerators, commercial ( XML / crypto / search / data mining) accelerators, and communication / interconnect accelerators. An FPGA can't compete against ASIC-level GPUs or Intel Terascale chip (the latter could be a good candidate to fit inside such FSB socket), but, being programmable at the gate level, it can offload some specific routines at hardwiring speed, and tackle both computation and commercial algorithms easier.
Now, Intel probably wouldn't want all those FSB-FPGAs to also have to copy the whole MMU from, say, Core 2, to be able to share the paged virtual memory fully - so, to use the shared physical memory efficiently, they need to cut on the unnecessary memory-memory copies and map the physical memory space that the FSB-FPGA needs to use - yet share it easily with the CPU.
On the communication and interconnect front, imagine, on one side, the whole 40GigE TCP/IP with security, and/or Infiniband and, why not, Quadrics shmem virtual cluster-wide shared memory interconnect stacks fully hardwired around a powerful network processor with modular interface links (CX4 or Cat7 or Fibre, with protocol autosense depending on the switches used) on one side, and 8.5 GByte/s FSB socket with direct memory access through a compliant MMU (or alternative memory sharing implementation) on the other side? Sounds to me like an ideal clustering interconnect engine.
The AAL should make FSB-FPGA (or any other attached accelerator) programming simple, as if programming a coprocessor. Each AFU (Accelerator Function Unit) library implements the set of sped-up functions for the particular accelerator (you can have two or more of accelerators, keep in mind), accessed via messages to the accelerator chip. To avoid the MMU trouble, the AAL defines a shared memory block for the application, mapped and locked into the user space, where the accelerator chip can read and write stuff that the CPU leaves for it, and leave the results for the CPU to use back in the main app.
So, what's the benefit, besides bringing back the decade-dead X87 / Weitek / Cyrix days of coprocessors competing for your attention? Well, you might be able to extend the "core functionality" of the system with many more tightly-coupled co-processor function - whether you want an extra teraflop on the side, AGEIA PhysX on a chip, hardwired Micro$oft Office routines in a giant bloated FPGA to finally run it smoothly, or a superfast interconnect for 100 nodes that deals with each node at the FSB (not PCI-E) latency and bandwidth, yet offering true NUMA-style shared memory around.
Also, this exercise will extend the useful life of this FSB by maybe an extra year, bringin a potential rejuvenation of some multi-FSB platforms. At the same time, it is an excellent pilot run for CSI-based accelerators, where you won't need to sacrifice CPU sockets to plug in the accelerators - simply add extra CSI sockets for them, there will always be spare high-speed links available. Intel needs this if they want to have ready-to-roll stuff when CSI comes up to fight HyperTransport.
In the meantime, for Intel, the new Caneland 4-way chipset platform suddenly gets much more potential use than just hosting four Tigertons on individual FSBs. In this case, you could have two Clovertowns and two accelerators sitting on a single Caneland, sharing the huge common memory (anyone says 256 GB in 32 sticks of 8 GB FB-DIMMs? and 200 W just for that memory?) - this could bring this chipset right into the high-end workstation and computation area, beyond niche high-end commercial servers alone. And, it is an expensive chipset, surely good for Intel to sell a few more?