Chip weather forecast: Price wars are imminent. Squally showers.
AT LONG LAST, AMD is releasing Shanghai, the company's next generation core. If you think of this as an unborked Barcelona, you are thinking right.
Any comparisons to the B word tend to bring an immediate unpleasant reaction from most observers, but Shanghai doesn't deserve any of that. It is a really solid chip, and consistently scores 20-30 per cent higher than Barcelona, and usually beats Intel Penryn-based CPUs in FP, but loses by a hair at Int. In any case, this is renewed competition at its finest.
So, what are the raw specs? From a slideware view, it is simply a Barcelona, shrunken from 65nm to 45nm, with an extra 4MB of L3 cache (6MB total now) thrown in. The clocks went from the older 2.0-2.5GHz to 2.3-2.7GHz while keeping the same TDP (Please note, AMD quotes ACP numbers, so this corresponds to 90-95W TDPs). So far, so similar.
The devil, however, is in the detail, and there are a lot of detail changes and fixes from Barcelona. The big bangs are obviously the new 45nm process, better prefetch, more and better cache, HyperTransport 3.0 (HT3), faster virtualisation, RAS features on the cache and something called Smart Fetch. Lets take each of these in detail.
The 45nm process has been getting a lot of hype because it is likely the first widely-deployed use of immersion lithography. The idea is simple, you do some of the lithography steps under water, or under fluid of one sort or other. This allows you to play games with the indices' of refraction, meaning you can draw finer lines on the wafer. Intel does the same thing in air, but they do it with two passes. Six of one, half a dozen of the other.
One thing that is notably lacking is High-K metal gates, something that will likely come with an update to the process, or more likely at 32nm in about two years. If the take-home message you are getting is that the lines are smaller now, that is all you really need to know.
Moving right along, we get to better prefetch. What this does is to allow the CPU to pull things from memory to the cache in a more intelligent manner. Every time some needed data is not in the cache, the CPU essentially sits there and waits, sometimes for tens or hundreds of cycles. A few tenths of a percent more efficiency here can pay back huge performance dividends.
Related to this is more cache, upgraded from 2MB of L3 to 6MB. This adds 10-15 per cent more performance, a number that seems quite plausible. The biggest gain won't appear on the slides, and it is one of the biggest unborkings of the whole project. To say that the Barcelona L1 and L2 cache was slow and didn't work all that well is being kind. Shanghai cleans this all up in a big way, and the L1 clean-up drops latency by a lot. It is hard to overstate how big a deal this is.
HT3.0 is not all that big a deal, mainly because of backward compatibility. Current S1207 Barcelona boards can take a Shanghai with a BIOS update. Since S1207 does not use HT3.0, Shanghai's plugged in to them will step down to the older HT1.0 speeds. When the new SR5690 chipset comes out in mid-2009, this will no longer be a problem.
Virtualisation is one area where AMD still leads in most ways, and Shanghai doesn't slow down. There are three main improvements, and one more on the horizon. The biggest is 25 per cent drop in world switch time. Basically when a hypervisor changes VMs, you can have several to hundreds of world switches. Dropping this by a quarter can lead to huge speed increases.
Similarly, page fault handling and TLB accesses are sped up. If you don't have a clue what these terms mean, read this and this and this. If your ADHD won't let you get through those, the short story is, once again, virtualisation gets faster.
The last one is also coming with the SR5690 chipset, and it is virtualised I/O through the use of a virtualisable IOMMU. Up until now, whenever you needed to access a device like a video card or a NIC, you had to drop to the hypervisor, twiddle the data, and pop back in to the VM. This was very expensive and slow.
The I/O virtualisation allows the VM to directly see devices that are virtualisation aware, the current count of such devices is at zero. That said, it won't be long before they are out, and things get much faster. The first generation only allows for 1:1 mapping, IE a VM gets hard assigned a device, and just owns it. Inflexible though this may be, it is well worth it for the performance gain. NICs are cheap, buy lots. Future revs will let you do many mappings, but that is a little farther out.
The SR5690 also enhances the Device Exclusion Vector. This keeps VMs from stepping on each other's memory and I/O with much more granularity than the old way.
Nearing the end, we come back to the L3 cache. Shanghai can map out parts of the cache that go bad, a feature called L3 Cache Index Disable. It is unlikely to be used, but if it is needed, it may just save a system from some quite nasty errors.
Last up we have another feature of the L3 and core called Smart Fetch (SF), by far the most curiously named feature of the lot. What this does isn't really fetching anything, it is more of a cache saving feature. SF basically takes the content of the L1 and L2 cache (each core has it's own) and saves it to L3. This allows the core to be turned completely off on the fly without losing any data. Yes, it is an energy efficiency feature, and AMD claims up to 21 per cent power savings from it. How this is related to fetching is still an open question.
Other things not directly core related also add performance here and there. The biggest of these is support for DDR2-800 memory and more DIMMs per channel. Shanghai supports one more DIMM than Barcelona, or it allows you to go one speed grade higher with the same DIMM count. Take your pick, neither is a bad thing.
So, now that you know the differences between the older and newer parts, what exactly is coming out, and how much is it going to cost?
There are nine parts in total, five 2-socket and four 4/8-socket chips, all for servers. The single socket version will come in Q1, likely earlier than later.
The 2 socket parts start with the 2.3GHz 2376, and go up by two model numbers and 100MHz for each bin until you hit the top end 2.7GHz 2384. The 4/8-socket models lose the low end, and go from a 8378 at 2.4GHz to the 8384 at 2.7.
Costs for the 2 socket chips are $377, $523, $698, $873, and $989, priced from slow to fast, 2376 to 2384. 4/8-socket parts are $1165, $1514, $1865 and $2149, again from 8378 to 8384. All of them should be available now, or very shortly.
In the end, it looks like AMD is back in the game, not necessarily a clear winner, but far, far better off than they were with Barcelona.
They have a few month lead over the 2-socket Nehalem parts looming on the horizon, and a full year before the 4-socket Nehalems hit.
Being competitive is a much nicer place to be. µ
wonderful writeup and seem like a good product. 

can't wait to see independent benchmarks. btw, just checked the oem websites, no one seem to have it for sale yet...
AMD leads in virtualization???? 
there is no independent benchmark for virtualization, only vmmark - which is vmware exclusive,,,
and secondly has Intel has leadership in this benchmark and will continue even with shanghai. By the time HT3 based performance is available (mid 09), nahalem will be out and sink any hope of "virtualization leadership" for AMD. 
More competitive than barcelona, yes, but not really competitive.
Charlie has posted an article about AMD! Where are the Nvidia fanboies? Where are the flames!

BTW, I like your article, Charlie.
Ya I seen the numbers compared to Penryn using FD dimms it looks good. But thats it. I have seen the Numbers for intels server IC7 . K10.5 is a dog in comparison . To bad for AMD . Times are Bad . Gives intel time to get IC& server parts out . There is NO comparison. 

Penryn server parts will be way cheaper than AMDs . IC7 will do the dirty work .
Kem, I don't know where you read that Intel leads in virtualization, but looking at vmwares numbers it looks like AMD does. From http://www.vmware.com/products/vmmark/results.html:
8 core, 2 socket:
1. Dell AMD 2384 2.7G: 11.2@8 tiles
2. HP Intel Xeon 3.33G: 9.15@7 tiles

16 cores, 4 socket:
1. Dell AMD 8384 2.7G: 20.35@14 tiles
2-9 various AMD systems
10. HP Intel Xeon 2.93G: 14.14@10 tiles

Granted it does look like Intel owns the 4 core benchy, but then again there is only one system in that catagory.
Competition from the lesser green team is very welcomed but i still remember an article for the not so lovable fat boy (charlie) stating Barcelona was going to be around 15% faster than core2 clock for clock and although it will be clocking 2.7Ghz when it hits. The clock for clock advantage will make up for it. The result? Well Core2 was infact that 15% faster and Ghz was down the drain. Personally i think it will be competitive with Penryn clock for clock but then you can still overclock thoes two 4+ ghz and then there is Nehalem. PS fat boi the anti- nvidia thing is getting old jus report the facts ok
Yes, AMD leads in virtualization. At work I was surprised to discover, via my own benchmarks with KVM-78 under Linux, that a 2.2GHz Phenom 9550 was literally kicking the ass of a 2.4GHz Core2 Q6600 by as much as 20-30% (the VMs were compiling C++ and Java code). Not only AMD is cheaper, but consumes less power, and do more at a slower frequency !

My guess is that the IMC (Integrated Memory Controller) and NPT feature (Nested Page Table) are mostly responsible for this lead... Intel is supposed to introduce their own version of NPT in Nehalem, it will be called EPT (Extended Page Table). Only thing, 2-socket Nehalems won't ship before 2009 Q2, so AMD will keep its lead for a while...