Nehalem platform Multi GPU options
Post-Nvidia, pre-Larrabee...
THE JUST FINISHED Nvision marked the end of Nvidia's high end chipset platform endeavours too.
The last day, with its X58 Tylersburg chipset SLI support announcement via paid BIOS "keys", was the fitting end to it.
However, are / were Nvidia's chipsets really that bad? Yes, the recent Nforce generations created enough heat to boil quite a few eggs, but they also still hold records in many areas.
Nforce 680i remains the highest-performing DDR2 chipset for the 65nm Intel Core 2 generation: clock for clock, it had the fastest memory benchmarks, and the best USB, Ethernet and RAID test results. Too bad Nvidia didn't see heat spreaders as a good idea for their hot north bridges. In the same fashion, the Nforce 790i Ultra still holds its own against the X48, plus adds some nifty PCIe peer-to-peer optimisations as well as simultaneous broadcast from the CPU to both cards - and keeps acting as a spare winter heater, too.
Now all that is seemingly over. We told our readers many months ago that Nvidia had the QPI license for the chipset works, but elected not to use it. In an age where many key chipset differentiators, at least on the north bridge side, have moved into the CPUs, it might actually make sense. Rather than try to differentiate just on the south bridge I/O stuff, which is insufficient to justify their double price premium over similar Intel chipsets, Nvidia just kept the Nforce200 PCIe bridge to enable SLI functionality.
Remember, the basic feature of the Nforce200 is splitting one PCIe x16 v2 channel from the North Bridge into two distinct PCIe x16 v2 channels for the graphics cards, with a bit of extra stuff added to speed up peer to peer comms between those two cards without bothering the system chipset. You can also see this chipset in, say, GX2 graphics cards as well as the Tesla 1U quad-GPU 4 TFLOPs HPC box Nvidia sells to the system integrators.
Like any PCIe bridge - including the PLX ones ATI uses on their X2 - Nforce200 adds latency, definitely not less than 100 ns. While it's not a problem for the comms between the two cards (it may actually be a little faster than the same two cards on the North Bridge), this extra latency affects any exchanges with the rest of the system, like CPU messages or memory access.
Since Nvidia lets the Nehalem X58 mobo vendors enable SLI at a cost of a " key" now - and, who knows, might even let it go free soon - is there any use for the Nforce200? I spoke to a top high-end mobo vendor today, the guy behind some of the leading designs in there? The answer is that the added chip is still considered for one or two designs, depending on Nvidia proving the performance advantage in reality.
The real benefit here is, basically, gaining three full PCIe x16 v2 slots in a Bloomfield X58 system: two off the Nforce200, and another one off the X58 IOH. Since, with the "key", IOH could also support SLI, you could have a native Tri-SLI here, with each card having full x16 v2 slot bandwidth.
Move over to dual-socket Nehalems, the Gainestown Tylersburg platform using basically the same IOH (yeah, the new moniker for memory-less North Bridge). Remember that IOH has two QPI links, so each CPU has a direct QPI link to the IOH there? What if each Gainestown had its own IOH instead, and the spare QPI links on each IOH were used to link them together? Voila, there we get a total of FOUR PCIe x16 v2 slots - fit in a 4870X2 in each of them, and you got a 10 TFLOPs box, at least for the few lucky apps that can use them.
Don't stop - say, add a Nforce 200 on one of them. Now you got five slots, and a choice of latency and bandwidth options, including SLI and Crossfire support across separate I/O bridges without compromising either's performance. Sounds like a great Jumbo Skulltrail design? Yeah sure. The cost? Uneven I/O latency from each CPU to the various graphics cards depending on which I/O bridge they sit on, and, in return, non-uniform main memory access from the GPUs, depending on the number of QPI hops they have to take.
In summary, if there is a proof positive that it helps SLI performance and doesn't slow down, say, a CrossFire setup in those same slots - and its excessive heat can be managed - Nforce200 might still find some design wins despite this week's about face. On the high end Nehalems, though, having two IOH chips might end up to be a simpler, faster and quite possibly cheaper way to add extra I/O bandwidth and x16 PCIe slots for large multi-GPU configurations. Intel will surely be more than happy to sell multiple Larrabees to fit in those slots - until a true QPI-based Larrabee coprocessor monster comes along sometime 2010 onwards... imagine it sitting in instead of one of those IOHs. µ

Comments
Its' One Great BIG Freeway.
rememberer:the basic feature of the Nforce200 is splitting one PCIe x16 v2 channel from the North Bridge into two distinct PCIe x16 v2 channels for the graphics cards, with a bit of extra stuff added to speed up peer to peer comms between those two cards without bothering the system chipset.
Without bothering system chipset, one less crash point. Yeah Sure=G-d in Ancient speak, so it must be HOT. Options? Do you want 8 gpus' or 10 gpus'. Think of rollar coaster with switch tracks & Multi Coasters all rolling simultaneously in Four Parks, interconnected.
With Ultimate being larger in bit string length, total leaves room for entire parade to fly about effortlessly. Remember these are used to CREATE Final Contents from its known library, Difficult task at best, instantly. Well few 100+ Ns off instant.
Taking this complexity to another level is where numbers will go fanatic, as latency will fall greatly, yet its all there ?right now. At least until dunnington, from banglaru Intel Plant gets teeth in.
At least complexity, therefore potential, are on steep rise.
drashek
Ha
good riddence to bad rubbishtheir own arrogance brought this sad state of affairs on them
I will continue to sell them short as they are still highly overrated
and I always thought their heatbeast chipsets were way to hot
Tylasberg
I don't think Tylasberg uses QPI to talk to another Tylasberg, but leaves it upto the CPU to route its messages to the other typasberg. I could be wrong, but that's what I remember seeing pictures of.The Inquirer Getting Soft?
Is the Inquirer Getting Soft on us? Either I'm a lot more drunk than I thought or you just posted a (somewhat) pro-Nvidia article. Even the best of us have our weaknesses. I still love you.Excuse me but..
I prefer to believe what industry expert and people that test things say about the n200 nonsense, rather than some fool trying to get more manufacturers to use it when they don't even have to, sorry.10 TFlops
4 4870x2 would be great for this board as top500 supercomputer last place has about 9Tflops. So this system would be great for scientific stuff if you can get it working full power. Dual socket system with 4 cards shouldn`t cost more than 5K $.about 550$ for each card, 550$ for both CPU, as you can use lowest bloomfield, as CPU`s contribute very lil to peak Flops in this case, 500$ more for mainboard, and 700-800$ left for memory, case and other stuff.
So real bargain as cpu cluster of this power costs millions.
680i
"Nforce 680i remains the highest-performing DDR2 chipset for the 65nm Intel Core 2 generation: clock for clock, it had the fastest memory benchmarks, and the best USB, Ethernet and RAID test results"Don't know exactly what part of the raid you were testing, but as soon as you add more than 2 drives the HDD controller bottlenecks up and you won't get any extra performance.
Over on the SLI Zone forums I am daily giving people advice to either get an intel chipset, dedicated raid card with IOP or toss the idea of raid completely.
Never thought I would argue this way with Inqs view of an NV product, but here we are :)