First, let's state the obvious, SLI is out, and it works very well, but has some warts. Those warts are being ironed out with all due rapidity, each driver release plays whack-a-mole with a few more annoyances. For this article, I will assume all the remaining issues are solved, and ATI, while having similar teething troubles will fix any bugs in a similar time span. This is about philosophies, not specific implementations.
So, how does SLI work? Well quite simply, it has two cards connected across the PCIe bus to the rest of the computer. The cards are connected to each other via a SLI connector, a short, high bandwidth bridge that runs independently of the PCIe bus. The PCIe links to the cards themselves are for the most part 8x each, most SLI boards take a 16x slot and carve it into 2 8x channels. Some high end workstations use the really nice 2200/2050 MPCs to get real 2x16 PCIe channels, but that is a vanishingly small percentage of the market.
So, you have two cards talking to the external world through an aggregate 16x PCIe, talking to each other through a link that is 'fast', and to the monitor through the usual VGA or DVI port. So far, nothing new.
The modes in which it works are AFR or Alternate Frame Rendering and SFR or Split Frame Rendering. They do pretty much what they sound like they do, and if they work, give you a huge boost to the available rendering power, maybe not double, but close. Since the bridges only support two cards, there is a hard cap as to how many cards you can use in unison. There are rumors floating about n-way support, but right now, nothing official.
ATI is a little harder to pin down, everything is rumour, but there are enough clues to get a good picture. First, it does not look like there is a connector on the cards for an SLI-like bridge, everything is sent over the PCIe bus. This seems to be because of the reactionary nature of ATI in response to SLI, they basically were caught with their pants down. R520 was probably far enough underway when SLI hit so as to not allow such a chip to chip communication to be implemented in hardware for the first generation of ATI products.
While that may sound bad, it really isn't for reasons I will get into later. ATI currently has a SLI-like mode in some chips, read this for more info. The short summary is ATI looks to be introducing a tile based mechanism for SLI.
The up side to this tile mechanism is that it can me done mostly in software, hence its attractiveness at the 'oh god, we need a solution NOW!' meeting you know ATI had. Also, unlike the chess board analogy, there is no need to limit yourself to two 'colours' of tiles, so you could theoretically use as many GPUs as you can cram into a machine, a definite plus.
So, what about the number of GPUs you can cram into a box? It depends a lot on the chipset you are using, and how many of them you have. Again, Nvidia has a solution out for dual 8x slots, ATI is still stuck at one, but there are pictures of a dually setup floating around the net, so I would suspect one is imminent. Via has a robust dually set of kit also, but since it doesn't want to pay the licence fee, you won't get SLI mode, or the logo. I have a feeling this will change dramatically when ATI comes out with their solution, such is competition in tech.
High end
On the very high end, it is Nvidia all by itself. The last time I talked to ATI, there were no plans for a
multi-NB setup, or even a 'professional' line for Opterons and workstations. While I have no doubt there is a crash
program going on right now, I don't expect to see anything from it this year. From the graphics vendors, Nvidia is
alone in offering that functionality. The dark horse in this race is Via, and I do expect it to have a multi-NB
solution out in the not so distant future, sooner than ATI at the very least.
To go out on a further tangent, we come to the CPUs themselves. By it's very nature, AMD has a much better platform to allow multiple intelligent NBs to work in unison. The point to point nature of HT is ideally suited to stringing chips off of CPUs in a Lego like fashion, and Nvidia has exploited this more than any other vendor out there. Intel on the other hand has a shared bus, and while you can plunk as many chips on it as you want, it becomes much harder to engineer with each addition. Until CSI gets here, the multi-NB world is pretty much an AMD only affair.
So, for AMD at the very least, you can have four Nvidia NBs on a system, at least theoretically. You can split them up into two 8x slots per NB, and get an 8 graphics card system. Engineering issues aside, an 8 port SLI connector would look mighty stupid, and I doubt we will ever see such a beast in production, but a four way one is an off chance. ATI, with its software answer is suddenly looking mighty smart.
If you also look at the current PCIe cards out there, the X850 and the 6800, they share one thing, they are not bandwidth limited on the PCIe bus. There have been a bunch of reviews showing that going from 16x to 8x has a minimal performance impact on the current crop of cards. This is most likely due to the fact that they are a transitional product, designed to straddle the AGP to PCIe transition. I fully expect the generation of cards that hit in 2006 to be able to take much better advantage of the bi-directional bandwidth offered by PCIe, but right now, we seem to have cards that don't exceed the AGP limits even when on PCIe.
This means that on a 16x slot, both cards are swimming in excess bandwidth. On an 8x slot, they probably have a little bandwidth to burn, but not a terrible amount. On a dual 16x setup, it is back to swimming, but the water is twice as deep. Nothing is able to touch a 16x slot, and probably won't for a while.
Overall, we have different methodologies of implementing SLI, different inter-card communications mechanisms, different capabilities offered by all the chipset vendors, and different limits imposed by the platforms. Got all of that, there will be a quiz at the end of the class. Seriously though, how does this translate into the real world? Well, lets look at how those capabilities are used.
Both cards render about half a frame, and the slave card sends it to the master card, which in turn sends it to the monitor. Both cards appear to have to keep a full set of the textures and geometry on board because each can be called on to render any part in the course of a frame or two. Nvidia cards in AFR mode definitely need everything on each card, ATI almost assuredly does, but Nvidia in SFR could possibly get away with a reduced geometry and texture setup one each card, but I really doubt it. What this means is that each card will need the same info a single card would need sent to it, so there is no bandwidth savings.
When the cards render their half of the image, it needs to send the resultant image to the other card. Here is where the philosophical differences start to crop up. You would think that they would just stuff it back across the a PCIe link to the other card, remember PCIe is not a bus, but a P2P link. This is exactly what ATI does. On two 16x links, the swimming in bandwidth is a plus, and it should not affect performance. On two 8x links, things can get a bit snug, but probably not a huge issue unless there is a lot of chatter between the cards for other things.
Nvidia sidesteps this whole issue. The SLI connector provides a direct link between the cards, and it is more than sufficient for the task. In fact, since it was engineered specifically for the job at hand, it should pass the data plus whatever other overhead there is without breaking a sweat. More importantly, it should do so without adding any PCIe traffic.
The take home message here is that NV is easier on the bus, ATI is easier to implement, and a little more scalable without widgets. NV could probably make a 3-4 way bridge if need and sanity called for it. Don't hold your breath for anything more than a tech demo at a trade show though.
How much bandwidth are we talking about? Well, when gaming on my rig, currently a P4/3.46 with a GF 6800Ultra, I can run most games at 1600*1200*32 at 85FPS refresh. A little math shows that that is about 1245MBps passed to the monitor. Assuming that that the slave card passes half of that to the other, and there is a little overhead, say 50% for geometry, textures, and the always necessary miscellaneous, we are at about 1GBps flying between the cards. This number, a guess when I started the article is more or less correct according to some graphics card people in a position to know.
Now, PCIe has about 2.5Gbps per lane, note the small b versus the large B, one is bits, the other bytes. This means the 1GBps is a little more than three PCIe lanes. If you bump the rez up to 2048*1536, you get about 1.6 times the data, and a little less overhead versus the raw pixel data. Lets just say that the next bump in res should take you to about 1.5GBps between the cards.
Ignoring the added geometry and texture use from the next gen games, and pretending the 512MB cards are not just around the corner, we still end up with the equivalent of 4-5 PCIe lanes eaten up by cross card babble. Now, tests have shown that the 16x to 8x PCIe transition does not make much difference, and on the last generation of AGP cards, going from 8x to 4x made a little difference. That means the current crop of cards, the 6800/X800 probably use more of the available bandwidth than the old 5900/9800 lines, putting them above 4x but below 8x, but with a lot less headroom than before.
Shrinking the die
So, piecing it all together, we have overhead in a dual 8x setup, but it is shrinking with the added
capabilities of the cards. The next gen of cards will probably fill an 8x link with little or not room left over. ATI
needs some slop on the bus to pass the data over for SLI to function. Nvidia has the SLI bridge that negates this
need.
So, while the ATI solution looks really good on paper, with theoretically 20 or more cards capable of tiling a single image, the PCIe bus at 8x becomes a bottleneck in short order. I would think the R520, if it keeps the benchmark of doubling the last generation's performance like the graphics card folk seem to always do, will hit this bottleneck hard. The bandwidth afforded by a dual 8x solution with two cards will be uncomfortably snug, and there are no 32 lane cards on the near horizon. Nvidia has something close in its multi-NB setup, but how much do you want to bet that the driver won't play nice with a dual ATI rig?
So, for me, the open question is what, if anything, can ATI do to fix this problem? Will it come out with a dual 16 channel chipset? Will it do some kind of image compression to lessen the load? Will it just grin and bear it?
To flip the problem over to Nvidia, how will it respond to the inevitable four GPU demo that you know ATI will put out to demo the chip? Will it come out with a similar demo with an octopus connecting the cards? Will they come up with something more elegant, maybe half an octopus with some added bus traffic?
On the chipset side, we know that there are no real 36 channel, 2x16 and 1x4, chipsets looming, so that much is out of the question. We know NV has a lock on the current SLI setups, and Intel is in the process of certifying some SLI capable chipsets now. ATI has a dual 8x board on the way, but is the current chipset capable of supporting an SLI like mode? If not, what will they use?
The dual card market, other than the NV SLI setup, gets really murky real fast. The PCIe bus that is supposed to be the savior and carry us to the next level looks like it is under severe pressure less than a year into it's life. The next gen of GPUs, when they come out, if they hold true to past patterns, will overwhelm the meagre capabilities offered on current motherboards. Who said the interesting times were behind us in computing? ยต
(1) I am pretty down on ATI of late with their release schedule, or lack thereof. Let's just say I agree with Scott Wasson here and here.