The first thing you notice is that Intel has abandoned the long pipe, high speed, lower IPC model that was the norm for the last few years. Merom just about halves the pipeline length. It is now 14 stages, but whether that is 14 critical stages, or 14 overall was not stated. Either way, it is going to give up a lot of MHz to the Pentium 4, but will end up faster anyway.
The basic structure is a four issue wide core, without going into specifics, they said the Merom cores can keep up a sustained 4 ops from issue to retire. This probably has a bunch of caveats, addenda and asterisks, but it is clear that wider is the course for the day. Each pipe is a full pipe versus the old P6 derived simple and complex pipe structures. The number of ALU ports are greatly increased also.
The family picks up a lot from the previous Yonah architecture, the dual core, shared L2 and low power architectures. It also picks up some of the baggage like the long in tooth FSB and stronger integer performance than floating point. Everything is new, even if it looks similar.
Lets look at these things in a little greater detail. First is the shared L2, something that debuted with Yonah. This carries forward to Merom, but there are some important differences. Since it was not an addition to the architecture, but there from the first day, you can make assumptions based on it. One of those is a direct L1 to L1 link to cut down the time needed to snoop the cache. Since it cuts out two L2s and a bus traversal, it can cut the time down to one third of what it took the 'old way'. It may not do much between sockets, that is what some of the Blackford chipset enhancements are for, but it will make a significant difference.
The cache is fed through two new prefetch algorithms, which are bandwidth aware. It was not stated outright, but it looks like one is for L1 and the other are L2. They can change their modes depending on how much bandwidth is available, it is the next step of speculative prefetch.
Along the same lines, Intel has a technology called Memory Disambiguation. The fancy words can be translated to English as 'we check dependencies on retire, not on load'. Combined with the speculative loading, it can do a lot for keeping stalls from happening and raise IPC.
One of the tricks that Banias brought to the forefront was Micro-Op Fusion, basically the ability to gang multiple decoded operations into one single. Merom takes this much farther, and adds a more sophisticated version to the mix. In addition, Merom has Macro-Op fusion, the ability to gang x86 operations before decode. As an example, if you have a multiply followed by an add, Macro-Op fusion can turn that into a Multiply and Accumulate. Again, this simplifies the complex process of x86 execution and again increases IPC.
Then comes power savings, which is what this family was designed to to. Pentium 4s are pretty much on all the time they are powered up, and if you need to cook eggs, this is your chip. Banias/Dothan took power savings seriously, and allowed the chip to power down units that were not in use. This was a massive power savings.
Merom goes well beyond this, all units are powered down in the default state. When units are needed , they are powered up, and the chip takes power savings to a new level entirely. The unit power up takes a few clock cycles, and again, while exact numbers are classified, it is more than one, less than 10 in most cases. This depends greatly on processor power state, but it should not be all that noticable.
On the baggage side, the lower integer performance is more due to the shorter pipe length, and it looks like Merom cores will be faster than Opteron+'s in int, but lose a little to them in FP, quite the change.
A lot of this is due to bandwith to the cores, and that is the weakest link for Merom. They keep the current infrastructure, can keep the chipsets, and keep the FSB. The target for Woodcrest, the server version of Merom is a 1333MHz FSB. The quad core MCM Clovertown will drop down to 1066, and Conroe will sit on 1066 also. I think that Conroe will end up on a 1333, but officially, it isn't. Merom will be lower due to power constraints.
How much power does it take? Merom is listed at 35W TDP, with a 1-2W average consumption. Intel is supposed to be binning on power consumption as well as power, so the higher speeds may end up to actually use less power. Conroe sits at 65W for the desktop, and Woodcrest is at 80W. Conroe and Woodcrest are substantial improvements over their predecessors, and Merom is slightly higher outright, but vastly more efficient as far as performance per watt is concerned. It should end up more efficient overall because it can do more in less time more efficiently, but I will wait for samples before I say that for sure.
How fast are they? Well as far as raw clock speeds, Merom will be in the low 2Ghz range, Conroe and Woodcrest in the 2.5-3GHz range and Clovertown a couple of bins down from Woodcrest. Clock for clock, look for a 30% improvement. This chip is going to give AMD quite the run for its money. µ
Some people do have a lot of time on their hands
It's only been days since its release but hackers gonna hack
And some spent, er, just £238
Souped-up flagship also packs Warp Charge, 256GB storage