A clumsy way of getting around the 8086/8086 1 MB address space, but still far better than swapping out of hard disk which MS DOS, well, couldn't do anyway - no such thing as virtual memory and paging on a PC in those days.
Importantly, the PC/XT expansion bus had basically the same 8-bit 4.77 MHz timing as the memory bus on those systems, so only the added latency was the penalty - the theoretical bandwidth was the same as for the PC memory - except on then ultrafast, stylish Olivetti M24 PC, which ran a fully 16-bit 8086 at 8 MHz, often twice as fast as IBM PC/XT.
Once PC/AT with 80286 came out, the address space went up to 16MB, so those added memory cards could be also linearly addressed. The PC/AT (soon also known as ISA) bus was now also 16-bit, upping its clock in tandem with the 80286 CPU to 6, 8, 10, then 12 MHz, where timings and connector became a problem to go farther. Anyway, many PC/AT systems could - and some actually did - enhance their memory this way, far beyond the CPU bus and chipset memory controller driving limitations.
With the 386 CPU, when the discrepancy between the CPU bus and I/O bus first appeared, the lure of expanding the memory via I/O slots waned, as the performance difference was way too high.
Fast Forward to the 21st Century now...
UltraLarge Memory - hidden Opteron advantage?
As X86 architecture is now fully in its 64-bit phase and who would've thought of it in 1980, looking at the clumsy 808, it can address lotsa memory if given a chance to - surely that's good for supercomputing, large databases, future super-duper games or 3-D simulations.
The current dual-CPU Xeon64 chipsets provide for four to eight slots of RAM, which, if using the 2GB registered DDR2-400 DIMMs, provide you with up to16 GB on-board memory - not bad for a start. For dual Opterons, same story - if using 4GB registered DDR DIMMs, you get 32 GB RAM right now. On quad-Opterons and quad-CPU Potomac XeonMP systems, there are usually four channels of memory, each with four DIMM sockets, unless some kind of bridging is used to enhance the memory capacity at the cost of higher latency, so the capacity doubles.
But what if we need even more memory, yet no more CPUs? After all, many large computing jobs may be happy with certain fixed computing power, but as large as possible RAM - database searches, proteonics, high-resoluting weather models or computational chemistry come to mind.
On Xeons, well, we could either make memory controllers with more channels, use bridges translating one memory channels into two, or wait for FB-DIMM generation with more channels anyway.
What about Opterons? The integrated dual-channel memory controller limits you to four DIMMs if using DDR400 timing, or up to eight DIMMs with DDR333 / 266 timing (see HP Proliant DL585). This way, a four way Opteron could have 64 GB of DDR400 or 128 GB of slower DDR memory on board. Then?
Well, each 8xx series Opteron CPU has three HT channels (currently supported at 1 GHz for 8 GB/s data rate per channel). In a quad-CPU configuration, let's say two channels go to the two neighbouring CPUs, so one channel is free on each CPU. Let's say then that one channel on CPU 0 and one channel on CPU 2 go to the I/O through respective PCI-X and PCI-E HT bridges and tunnels (sounds as if we're talking about a highway). This gives us 16 GB/s of total I/O bandwidth, more than enough for any current dual-GPU workstation, server or even 'distributed shared memory' tight cluster wit, say, multiple Quadrics rails.
So, one channel on CPU 1 and another on CPU 3 stay free - 16 GB/s of unused bandwidth. What if those 2 channels could connect to a large daughtercard (maybe on a dual-channel HTX slot format) with a nice memory controller circuitry that takes in those 2 HT channels on one side, and provides an extra eight 64-bit buses of DDR2-400 memory, for instance? That gives us an extra 32 DIMM sockets - with 4 GB DIMMs, it is an extra 128 GB RAM, and if using bridges/translators, you could further double the number of channels and DIMMs, to a total of 256 GB extra RAM, on top of the usual on-board memory.
Now, this memory would naturally have higher access latency for the on-board CPUs compared to their own RAM (probably an extra ~200 ns), but the bandwidth would be about the same, in fact two CPUs could access such RAM bank in parallel at full HT speed without contention due to so many channels. If insisting on latency reduction, a local SRAM cache of say 64 to 128 MB could be optionally used to face the two HT channels.
Any of the four CPUs on-board would need a maximum of 2 HT hops to reach the memory controller on the daughtercard, so, in an optimised design, the speed penalty would be low enough to treat this extra memory as a linear extension of main RAM, without the NUMA-ish "near" and "far" memory tricks required. An optimised quad-socket (up to 8 CPU cores) Opteron board with good cooling could fit this daughtercard on top of the motherboard, and still have the whole thing comfortably within a 3U chasis.
In the near future, with new Opteron sockets, and more & faster HT 2.0 channels (after all, AMD could easily put up to 6 HT channels on a next-generation high-end Opterons for greater SMP, I/O and memory scaling), this approach would make even more sense.
And for now, just imagine, 192 GB RAM with very respectable bandwidth in a standard 3U quad-CPU box! A great deal for memory-intensive HPC or database clusters, and hey, this much RAM will probably be enough even for the near-future 64-bit MS Office too, no matter how bloated that one is expected to be... µ
Sign up for INQbot – a weekly roundup of the best from the INQ