Advertising is the rattling of a stick inside a swill bucket - George Orwell
THE HD4870X2 is finally here, and we're checking how far the unit will go coupled with the X48 chipset on a QX9650 Asus system.
Yes, the current driver doesn't seem to support the 'sideport' yet. What the hell are we talking about here - besides chasing ATI to enable that in the driver asap?
Well, some time prior to the R700 launch - in fact even before the official RV770 single GPU cards were out - there was a whisper circling around that ATI's newest chippery might have better ways of talking to each other when on a single card rather than wasting time through a PCI-E bridge. After all, PCI Express isn't exactly "Express" in terms of raw latency (a bit worse than even the old PCI-X according to the interconnect gang) and having to go through an added third party chip like the PLX units you see doesn't help reduce it.
Talk about bandwidth? Well, for exchanging stuff, the theoretical 8GB/s per direction bandwidth of 16-lanes wide PCI-E v2 isn't bad, but pales in comparison to 115 GB/s of local memory bandwidth per GPU - at one-tenth the access latency too.
The latency buzzword didn't create much buzz in the GPU space as graphics processing usually meant processing long streams of data (polygon / vertice lists, ever-larger textures, and scan lines for frame buffer display) rather than small chunks of data, where the first word latency means a lot to the overall processing speed.
Now, GPGPU - whether the idea of it is successful or not - changed that. Computing in general is far more affected by that "first word latency" for small data packets (sometimes just a set of 64-bit operands for a FP computation for instance). Otherwise, do you think those low latency memories would matter that much?
Now we come back to the R700 pair of two RV770 chips talking over both PCI-E x16 v2 - via the PLX bridge and its 140 ns or so latency - as well as direct 'sideport' with another claimed 2 x 5 GB/s bandwidth. Of course, the Crossfire connectors on top add another nearly a gigabyte/s but, for now, we won't count that.
Now, if that 'sideport' is just another PCI-E direct link, it will obviously have a far lower latency than the bridged one - useful for frequent smaller data transfers common in GPGPU computational apps. And yes, there will be some benefit in the possibility of parallel multiple transfers between two GPUs using both bridged and direct links, improving Crossfire scaling in some cases.
What could be done to improve the connection between the two GPUs so that there's no need for duplicating the content between each GPU's respective memory? Since ATI is now AMD, for better or worse, the Hypertransport 3 could be a plausible replacement for the PCI Express. First, the bandwidth is far higher (2 x 32 bit HT3 link could provide 45GB/s total bidirectional bandwidth), and the total interconnect latency is halved - benefits both bandwidth and latency sensitive usage models.
Secondly, with two chips placed close together and full timing controls, the HT3 direct link clock could be raised further, to push the total bandwidth close to 60 GB/s - half of the current GDDR5 memory bandwith for each GPU. And if the future RV8xx chips are placed together on an MCM substrate, this link could grow both wider and faster across a very short millimetre-scale distance to basically match the per-GPU memory bandwidth.
Even without that, it'd still be good enough to make the local vs remote GPU memory speed difference far less painful in many apps.
Finally, Hypertransport can, like Intel Quickpath and their - in some ways still superior - daddy, Alpha EV7 interconnect, provide non-uniform direct remote memory access across common address space, with a choice of cache coherent or not, depending on the use. That vastly simplifies addressing the memory of multiple GPUs between themselves as a common address space.
If, then, we had, say, two identical HT links per GPU, we could make an on-card (or cabled) four-way Crossfire GPU link with all of them sharing the common memory at minimal latency and bandwidth loss.
What about internal implementation? Well, look at those 512-bit ring memory controllers within each RV770, with four 64-ports creating the 256-bit outside GDDR5 memory path. If having one HT link inside, just add two more ports, one read and another write, for the outside link. For two HT links, just double it - it will then, combined with the local memory width, finally saturate that ring controller.
I'm not saying it must be Hypertransport - ATI may as well work out a more optimised, specific purpose link, but HT is out there and begs to be used. Oh yes, Nvidia is also a HT licensee - that may be one factor in ATI looking to something more 'enhanced' though.
One exception to this whole thing, no matter what we do, is: when there is specific data that both GPUs use very frequently and have to keep locally. In such cases, no point playing around with remote access: allocating part of the local GPU memory for this purpose would fit well. But in any case, ATI has made a good move by enhancing the multi-GPU link with direct high-speed bus links: shared memory multiprocessor graphics seem to be the next move - after all, CPUs have already trodden that path well... µ
A very interesting idea using HT for interconnect. 

I'm pretty sure that the RV770 no long uses an internal ring-bus.
torrenza. right in the middle of your article that little voice in my head started to whisper ever so silently "torrenza". 
although both chipzilla and daamit are supposed to integrate pci-e onto the die just like the memory controller, i personally would prefer daamit to use it's HT3 tech to link everything together.
this finally could be the "killer app" for HT3 and torrenza. think about it: a multi cpu/gpu combo with HT interconnects and some form of shared memory space. would give daamit at least some leverage against larrabee (if that holds what's promised).
could be some serious numbercrunching machine.
i'm under the understanding that the sideport currently isnt used. that would be a first step.
Where do you morons come up with this stuff. Its why most of the "internet" review sites and chat rooms are filled with balderdash....

There is no ring bus in RV770 - AMD got rid of it in favour of a 'hub' memory controller in order to save die space and optimise perf/mm^2
how about they make it genuinely reliable?
There is no ring bus for the RV7x0 series. They've returned to a PTP crossbar ROP/TF/TU/MC arrangement, with an extra two ALU blocks to fill out the die space.
But the thing is, the GPU don't need to talk to eachother much, if one does half the screen and the other the other half it only needs to sync a few items, same for GPU applications, the calculations can be done and the results will be much smaller than the dataset and that and some sync signals need to be exchanged.
In fact the whole reason X2 cards have twice the RAM is that they just work on their own copies of everything and only the output is combined AFAIK, and yeah that's costly but means no need for super-speedy communications surely.
The article states ATI used a ring bus memory controller for the RV770. I thought that was ditched in favor of something different? Please confirm this.
Nice article, but the RV770 doesn't use the Ring Bus Memory Controller anymore, it uses a Hub based one which is more power efficient and is easier to use completely the available bandwidth while the RV670 and R600 was able to use only the 85% available bandwidth.
No wonder there are alot of people moaning, whining and bitching about the article, its all true, yet even I still use Vista for no particular reason.

Long live Charlie.

Some of these commenters should go away and never come back, they are disrespectful.
ati has already commented that it is unlikely they will ever enable the sideport. More ati features for the future that they have no intention of enabling.
What rubbish....
You have Hypertransport or you don't.
I wish AMD would force the HTX bus, it will simply put raw speed on the table for AMD CPU's and GPU's.
Simply make a DIRECT connect!
I’ve been working hardware and software since the days when tubes reigned supreme. The GPU’s will be integrated into the CPU as additional cores. This is the best solution at the lowest price. You will have to upgrade the entire chip to upgrade either component.
" Some of these commenters should go away and never come back, they are disrespectful.
posted by : UnReaL "

Now there's a Corleone talking. :) We like.

No really: Charlie has done his homework more then a lot of us; Torrenza dictates a HT link between components, be it on-die or in a separate slot or connection. All components ( cores/gpus) are to be landed on HT links. So the convergence should be on the table, spelled and rolled out Real Soon Now (tm) I see the sideport and the memory Hub both as temporary solutions; convergence also saves money.