The Inquirer-Home

There's magic in the Intel FB-DIMM old buffer

Part Two Memory technologies
Tue Apr 06 2004, 10:06
The first part of this article was published yesterday, here: Intel FB-DIMMs to offer real memory breakthroughs

THE BUFFER, that chip in the centre, is where the magic lies. Rather than the stub-bus configuration we saw earlier, the FB-DIMM memory system is point to point. There are no stubs, the memory controller talks to the buffer directly. Each buffer can then in turn pass the signal on to the next buffer in the chain, and so on and so on.

alt='fb-dimm'

The signals that stay on the DIMM itself are then passed on to commodity DRAM parts, nothing new is required from them, you can use plain old DDR2, and benefit from the low cost of those parts. While buffer adds cost, it is far less than making a whole new type of RAM chip.

The signals are sent from the buffer to the memory controller in a differential serial fashion, like PCI -Express or RDRAM, basically higher speed and lower width, which means lower pin count.

The clock signal is distributed separately from the data over a different set of pins, and the memory controller can talk to the buffer over the SMBus, also separate from the data bus. Basically, the controller can talk to the buffer in a host of ways, some in band, some out of band.

What does this buy you? First of all, a vastly lower pin count. FB-DIMMs have a pin count of 69 per DIMM, broken down to 20 for data to the DIMMs, 28 from the DIMMs, six power and 12 ground. Add in three more for the clock and other things like the out of band signaling, and you come to the number 69. Compare that to 240 pins for a DDR2 implementation, that is per channel mind you, and you have a vast simplification in layout and board design. If you don't think it matters, look at these pictures of a DDR2 channel on the left, and dual FB-DIMM channels on the right.

alt='fb2'  alt='fb1' 

Another thing that simplifies board design is the ability to use unequal length traces. The memory controller and buffer can make up for unequal length traces, no more weird routing paths to make sure the timing is correct. Upon initialization, the controller will measure the signal timings on each pin, and delay the fastest to match the slowest. In practice, this delay does not cause any real world performance degradation. The lessons of RDRAM seem to have sunk in at Intel.

All of this allows you to put in three channels of FB for the same pin cost as a single DDR2 channel. By the time FB hits the market, dual channels will be pretty much standard on everything. For anyone smart enough to know that Celeron=bad, you can count on 480 pins for a DDR2 implementation. Would you rather have that, or six FB channels?

Either way, a dual channel FB setup can be done in two board layers, including power routing. A single DDR2 channel needs three layers for the same thing. More layers mean higher board cost. If you look at it from the old tradeoff angle, you can make an FB implementation in a lower number of layers with equal bandwidth, or much higher bandwidth on the same number of layers.

But remember the point of this exercise is capacity, or at least a major point is capacity. Each FB channel can have eight DIMMs on it. That said, for a server, if you go for capacity over cost, they were tossing around some pretty one-sided numbers at IDF. For a lower pin count, 420 vs 480, you get about 4x the bandwidth, 40GBps vs 10GBps and up to 48 times the capacity. All this uses the same DDR2-800 DRAM chips that a standard DDR2 DIMM would use. Any time you can quantify the deliverables of a new architecture using the term ‘order of magnitude', and it describes anything but cost, you are on to something. I think the Intel team can do this with a straight face.

There is a downside right? Yes, latency, but it appears manageable. There are two types of latency that the FB architecture adds, serialization delays and each added buffer means a transmission delay. The signal must be read by a buffer and either acted upon or passed on. By the time you get to DIMM number 8, it can add significant time. In absolute terms, the latency is 3-9 ns, and each hop you go out adds another 2-6 ns.

Intel says it has gone to great lengths to address these latencies. First is the serialization delay. That part is unavoidable, it will happen no matter what you do. As the speed of the RAM increase, the absolute time of this delay decreases. At 400MHz, the delay will be twice the delay of ram at 800MHz. Since speeds are likely to go up in the future, this issue will decrease as time goes on. Also, looking at the Rambus architecture, it didn't hurt performance all that much, there were some pretty high performance chipsets using that technology. This leads me to believe that serialization latency will not be a killer.

The other issue is potentially more troubling. Intel addressed this by not having the signals be stored and then retransmitted. The data travels along a special fast-pass-through channel in the buffer itself. This lessens much of the latency that would be induced by store and forward architectures.

Additionally, you have reads and writes on separate channels, so you can do tricks that were impossible with standard shared bus architectures. DIMMS can act independently, some reading while others are writing, and there is no read to read delay. The last thing is that DRAM is run in sync with the memory channel. When you run memory in an asynchronous fashion, it adds delays and makes just about everything more complicated. Complexity means pain for designers, and reduced performance for users.

Overall, all this adds up to a higher latency, but much of it is mitigated by clever architectural tricks. An individual read may be slower, but since you can do multiple things at once, the overall effect is not all that bad.

Since a good FB implementation has at least two channels for every DDR2 channel on a comparable system, you end up with latency hitting you at light loads. While the DDR2 implementation may be faster for a lightly used desktop system, as soon as you stress the memory subsystem, FB starts to pull ahead. The way this happens is the concurrent reads and writes. If you have two DDR2 channels, you are able to do one operation at a time. For a comparable FB setup with four channels, you have twice the bandwidth, and you can do eight simultaneous operations, four reads and four writes for example. The FB spec allows you to send three different commands to three different DIMMs in the same DRAM clock period.

This flexibility means that more of the theoretical capacity of the memory subsystem can be utilized. Where DDR2 hits a ceiling, FB can keep on going. The more DIMMs you add, the more flexibility you have. The more channels you add, the more hoops you can jump through. In light of this, the low load latency hit doesn't seem all that bad anymore.

The architecture was built with expansion in mind, not just in the number of DIMMs supported, but also in other ways. Things that were add-ons in previous specs have been designed in from the start. What was an ugly hack is now there from the start.

The first of these is length. Previously, you needed to add repeaters in if you wanted any sort of distance between the controller and memory. Repeaters were so loved in the industry that I can't remember seeing one in use for a long time. FB solves this in two ways. First, the architecture is designed to allow 12 inches from controller to the DIMM, an extremely long way by modern standards, almost the length of a modern motherboard. Repeaters, designed in from the start, will add another 12 inches, allowing the signals to travel further than a standard 19-inch rack will allow. That should be enough for most people.

Risers are also built into the spec, so you can have those fragile but expensive boards sticking up vertically in your server should you feel the need. While it may be in the spec, hopefully they can be avoided with the added length allowed by FB in the first place. The spec also has a logic analyzer interface defined so you can see what is happening on the channel without disrupting it. As speeds rise, this will move from a luxury to a necessity for debugging boards. Its inclusion goes to show how well this spec was thought out.

If that isn't enough for you, think about this. If you look at the buffer itself, it has 3 purposes, send the memory signal to the controller, pass things on to the next buffer, and to talk to the ram itself. The first two are dependant on only the buffer and memory controller, something that isn't likely to change much. The part where it talks to the memory chips itself will change. Speeds go up, changes like the DDR to DDR2 transition happen, and almost nothing here remains constant. µ

Part Three of this article will be published tomorrow

 

Share this:

blog comments powered by Disqus
Advertisement
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

Advertisement
INQ Poll

Heartbleed bug discovered in OpenSSL

Have you reacted to Heartbleed?