There are 10 parts launching today, although only two are actually available now, the highest end HD 2900 and the Mobility HD 2300. The 2300 is a not a 2xxx class part, it is a renamed 1xxx part, so just ignore it in this context.. The others are the HD 2600 Pro and XT, the HD 2400 Pro and XT, and four mobility cards. The full names of those parts are the Mobility Radeon HD 2400 and 2400 XT, 2600 and 2600 XT. For this article, I will stick to the desktop parts.
First some raw numbers. The 2900XT is the highest end part, with a core clock of 740MHz and 512MB of 512b wide GDDR3 running at 825MHz. It has 700M transistors and takes about 215W depending on board configuration.
The HD 2600 and 2400 have clocks set at 600-800MHz and 525-700MHz respectively with memory speeds of 400-1100MHz and 400-800MHz. The 2600 uses either GDDR 2, 3 or 4, all in 256MB quantities and all accessed on a 128b bus. The 2400 has 64b wide GDDR2 in 128 or 256MB configurations or 256 of GDDR3. Board makers like flexibility on lower end parts for cost reasons.
The 2600s have about 390M transistors and consume 45W while the 2400s have 180M transistors and sip a mere 25W. This means that even the highest end 2600XT does not need a PCIe 6/8 pin connector while the 2400 can be passively cooled. OEMs love this kind of thing.
If you look at it from a purely numerical perspective, the 2900 has similar frequencies, twice the transistors and twice the memory at the same clock as the 2600, but it consumes almost 5x the power. Something sounds fishy.
The reason for this is the process that they are built on, the 2900 is on the TSMC 80HS while the lower grade cards are on TSMC 65G+ . Most people think a process shrink is all about die size and therefor cost, and that is true, but only part of the story. You want the volume parts, IE the lower end ones to be efficient as possible, the higher end parts are low volume with high enough margins so it is far less pressing to shrink them.
Raw shrinking will get you about to about the 2/3 the size ((.65^2)/(.80^2) ~= .66), but what about the power? Half the resources should get you to about half the power, add in a bit more savings from the shrink, but that still leaves the 2900 consuming about twice what it should.
The problem? The 80HS process is really 'variable'. Some cards come off the line in great shape, but an equal number come off leaky and hot. Well, they are all leaky and hot, but some are just leakier and hotter than others. This is about as bad a situation as you can get in semiconductors while still having functional parts at tolerable yields.
On the other hand, the 2400 and 2600 are built on a smaller process optimized for power savings. You end up with smaller GPUs that run cooler, cost less to make, and are generally better in every respect. Since most of these boards will end up in the hands of OEMs, I can't understate how big a deal passive cooling and lack of external PSU connectors is, this is gold for system builders.
When the R600 die is put on the same process and rechristened the R650, we should see massive power drops. This will manifest as either more efficient cards or more likely a higher end part that uses most of that drop as clock speed. Either way, this one will be the card to watch in late summer.
Getting back to the layout of the cards, from the 10,000 foot (3048M) view, the 2-series cards are continuing a trend away from crossbars toward ring busses. The first X-series GPUs from ATI, and all Nvidia parts, use a Crossbar switch. This means each unit that needs to be connected to another has a switched connection to the other unit.
This sounds good, and it is fast, but also complex. If you have 2 sources and 2 targets, you need to support 4 possible connections. With a 512b memory subsystem accessible in 64b chunks, you have 8 memory ports. Connect this to 8 functional units, and you are at 8 x 8 connections, 64 for the math impaired. This scales quickly, and has some really nasty consequences for design and debugging, die area and most notably power. 4x the number of units has 16x the complexity, area and power use. Not a good trend.
For the X1xxx series, ATI went to what they called a hybrid ring bus. This is ring that connects the units around the outside like a highway surrounding major cities with a much simplified crossbar in the middle. Things that are less latency sensitive and need a lot of bandwidth can go on the ring, others can stay on the crossbar. It is a good compromise.
Eventually however, the same scaling problems hit the hybrid model, and you end up with a pure ring like the R600 above. This is the overarching architecture of data movement on the Radeon HD 2xxx series of chips, and it makes a lot of sense.
There are several things to pay attention to, the ring stops, the rings themselves, the central core, and the memory connections in addition to the ring itself. That ring is a pretty hefty 1024b wide, organized into four busses of 256b each, two in each direction.
The obvious math is 740MHz * 1024b wide is 3/4 of a Tbps bandwidth per unit, or a 1.5Tbps cross-sectional bandwidth for the chip. Even better, no unit is more than two hops away from any resource they need. It isn't quite the same latency as a crossbar, but lets see you make a crossbar with that bandwidth and not have it melt the polar ice caps.
The stops themselves come in two types, four labeled "Ring Stop" and one labeled "PCI Express Ring Stop". In case it isn't painfully obvious the PCIe one is the point at which data moves on and off the card to the PCIe bus. If you are loading a texture from system memory, it gets pushed across the PCIe bus, to that ring stop, stuffed onto the ring, and sent to the appropriate stop for it's memory location.
That brings us to the main ring stops. In addition to the ring itself, they have two kinds of connections, the black squares and the red and black interior connections. The external connections are to memory, with each connection being a 64b wide memory bus. Two per ring stop means 128 bits of width per stop, 4 stops gets us to the 512b memory interface.
One interesting note is that the problems caused by increasing memory bus width. On chip, it is an annoying problem, you just draw more lines and change how you route data around. Rings solved this nicely. Off chip, well, you need pads for every data line, and you have to solder them to the PCB, not a trivial task.
ATI solved this with pretty high density pads, double the density of the previous generation. You can see these pads arranged closer than their brethren around the outside, and even a crosseyed glance at a modern GPU will show you how many traces this needs on the PCB itself.
This isn't just a trivial thing to do however, you don't just put them closer on a whim. In addition to soldering problems, you get electrical interference and routing problems on the PCB. It is a tough job, but ATI pulled it off. Then end result? 2140 pins and 106GBps of bandwidth off chip.
The interior connection to the ring stops gets a little interesting it is where the GPU functional units get their data from. It looks like this, deceptively simple but in practice it is a lot more complex.
You can see where the ring connects to the stop, and the stop is connected to no less than five logical units, some in both read and write capacities. This is where all the data gets thrown around, willy-nilly, and performance is won or lost.
What are the each of the units, and what do they connect to? That is a the topic for the next part of this series, stay tuned. µ
Sign up for INQbot – a weekly roundup of the best from the INQ