THE PARTS OF THE R600 that do the number crunching are just as interesting as the parts that shuffle the data around. Both are critical to performance in different ways.
THE RING STOP:
One of the central points in the HD 2900's architecture is the ring stop. It is where things go from the data shuffling parts of the chip to the number crunching parts. It is as complex as you might think, feeding five separate units at near terabit speeds.
One thing to make note of here, the ring stop is what all data passes through. From the number crunching cores to (more) local ram, or to remote memory, it all goes through the ring stop. Think of this as the memory controller, or more to the point, the memory controllers. The parts labeled arbiter and sequencer are part of the Ultra-Threaded Dispatch Processor(UTDP).
THE CORE: The core consists of five different areas that each handle a different task. They are the Setup Engine, the Ultra-Threaded Dispatch Processor, Stream Units, Texture units, and the Render Back-End. There are also a lot of things that support each of the major units, but these five are the architectural keys.
The 2900 has 320 of the Stream Processing Units (SPU) grouped into four SIMDs, four Texture Units (TU) and four Render Back-Ends (RBE). The 2600 has 120 SPUs grouped into three SIMDs, 2 TUs and 1 RBE. The lowly 2400 has a mere 40 SPUs in 2 SIMDs, 1 TU and 1 RBE. It appears that each SIMD has an associated ring stop, so that would put the ring stop count at 4, 3 and 2 for the 2900, 2600 and 2400 respectively.
THE COMMAND PROCESSOR Think of this as the a kind of pre-processor for the core. It takes information from the driver on the host PC and cleans it up a bit, making it a bit easier for the rest of the GPU to digest. It also will do some memory work.
The first thing it does is executes memory commands. If there is something that needs fetching or putting to memory, this is where it gets kicked off. If you think about it, the earlier in the chain you can fire off a memory request, the shorter the time you will have to wait for the result.
The Command Processor also preforms some state validations that used to be done on the host CPU. While this is less critical in this era of massive CPU power, the more you can offload from the CPU, the more flexibility you have. Additionally, the closer you can put this validation to the components that need it, the less latency there will be, the PCIe bus is horridly slow compared to on-GPU data movement.
The last bit is probably the most important, ATI claims that it will massively improve small batch transfers. The claimed improvement is about 30% in CPU time used on the host. While this may not be as big as it seems though, if a graphics driver takes up 10% CPU time, this will only save 3% in the end. Still, a saving is a savings.
Small batches are so important because they can bring a CPU to it's knees in some circumstances doing needless work. If you are bringing a load of concrete blocks home from the store, do you bring the Ferrari or the pickup?
The Ferrari can travel 3x as fast as the pickup, but the truck can carry 25x as many concrete blocks. When you take into account that both are capped by the road speed limit, the pickup is the clear choice even if it is less glamorous. Besides, Sergé will become apoplectic if you scratch the F430, Bufus might cry over an Old Milwaukee Ice if you nick the Ford. Bufus will get over it much quicker, take the truck.
In any case, the same holds true for data, if you have a packet that takes 100B of overhead, and you send 10K in the packet, you end up with 10.1K of data across the bus. If you send the same 10K in 100 100B chunks, you end up with 20K sent. By the time you end up sending 1B at a time, you are sending 100 times the amount of useless data as you are real data.
All of this useless stuff not only has to go across the bus, but it has to be packaged on the host CPU. Packaging 100 packets takes a lot longer than 1, probably about 100x as long. The command processor takes its results and hands them off to the Setup Engine.
THE SETUP ENGINE One of the big things for DX10, other than being MeII only, is unified shaders. The guts of the unified side start in the setup engine, the rest of the chip just carries out the commands generated here. The engine itself has five subunits, the Scan Converter/Rasterizer, Interpolators, the Geometry Assembler, Vertex Assembler and the Programmable Tessellator.
In DX9, there were pixel and vertex shaders, and they did pretty much what you think they would do. Vertex shaders worked on vertexes and modified the geometry in a limited fashion. The pixel shaders worked on pixels, no shock there. Instead of modifying the object, they modified the look of the object, basically it is the effects, not the shape.
DX10 adds a geometry shader to the mix. Vertex shaders work on the vertexes and change them, but the object itself is still the object. They can bend and move things, but the object is still the same overall. Geometry shaders allow the object as a whole, and do much more large scale manipulations.
In the old non-unified DX9 days, you had hardware to do the vertex shading and hardware to do the pixel shading. You drew the geometry, painted it, then tweaked it. For the first half of the frame setup, the vertex shaders were tweaking the polygons for all they were worth, working flat out. The vertex shaders were sitting idle. Once the geometry was set, the vertex shaders went to work, again, flat out and the vertex shaders were almost totally idle.
You ended up with a situation where half the units were idle for half the time, and that is pretty much wasted silicon. With geometry shaders thrown into the mix, there was the potential for the worst case scenario where the 2/3rds of the shaders were idle for 2/3rds of the time. Ouch.
To make matters worse, if you guessed wrong on the mix of vertex vs pixel shaders needed, you could get a chip that was simultaneously under and overpowered. Since no two games are the same, barring EA titles, you were almost assured of a mismatch, wasting silicon and lowering performance.
Clearly, you needed to do something about this, and unified shaders were the answer. If one shader could do the math of all three types, you could end up with all the shaders working all the time This is exactly what ATI did with the R600 family.
The setup engine chunks up the work into bite size pieces for later work. The Vertex Assembler and Tessellator do vertex shading code, the Geometry Assembler does Geometry shading, and the Scan Converter/Rasterizer and Interpolator works on pixel shaders.
Each unit takes the information from the drivers, the commands, data, and whatever other miscellaneous bits there is and packages it up. It does not actually do any execution of the commands, it just makes threads and submits them to a queue. There are three queues, one for each shader type. These command queues reside in the Ultra-Threaded Dispatch Processor.
The overarching view is that the setup engine does just what it's name says. It takes in disparate data and shuffles it to the engine that handles the correct data type. It is then broken down into simple instructions, and shuffled off to one of three queues in the UTDP. If anything is the brains of the chip, the setup engine is the part that should wear the crown.µ
Part 3 looks at more of the units. More tomorrow.
Sign up for INQbot – a weekly roundup of the best from the INQ