Intel talks specifically about Penryn
One of the biggest problems with any chip launch is cutting through the buzzwords and marketspeak. With that in mind, please feel free to ignore things like calling a bump up in cache from 4MB to 6MB "Intel Advanced Smart Cache", their bold, I do. That said, there is a lot of good stuff here.
The 10,000 foot view is this is a shrink of Merom from 65 nanometres to 45 nanometres, but it is heavily massaged on top of that. You can read more about the 45nm process here. Basically, you get about half the die area for the same transistor count, 20 per cent faster switching speeds and vastly lower leakage. These numbers are what is possible, not necessarily what you will see.
The architectural changes, are more important. The one you probably have heard of is SSE4, a new instruction set that makes multimedia faster and happier if your DRM infection deems you worthy of viewing your purchase.
The instruction set is all fine and dandy, you can just slap it in microcode if you want to take the easy way out. Intel didn't and added in a bunch of plumbing to avoid bottlenecking the new instructions. One of those is called the Super Shuffle Engine (S3), and it does what it says, shuffles bits around. S3 is very useful for things like data interleaving, and it can do all of this in (mostly) one cycle. The old way could do it, but it took longer and quite possibly involved multiple operations. S3 at one cycle avoids bottlenecking SSE4 and destroying throughput.
S3 is also useful for packing and unpacking, something that you never notice, but everything you use a computer for probably uses. Penryn has the ability to do most of the S3 ops in one clock, down from the 2-5 of the Merom family. Only one of the ops in the family, extract, takes three clocks, but that is down from the five it used to take.
The other thing that they added in is called the Radix-16 divider. With Merom, the adder and multiplier were revamped, but the multiplier was more or less same Radix-4 divider that traced it's lineage back to the Pentium, the chip, not the horse. The new one can do 4 bits per cycle instead of 2, more or less doubling the throughput.
It also is engineered to run at at much higher clocks, so it will again not throttle the architecture. They also got much more aggressive with the way it works. Penryn will do better at ignoring leading zeros and early exits than Merom, building on many of the advances of the earlier chip. This aggressiveness also caries back to other FP and even Int ops revamped earlier.
That brings us to the first big thing that the shrink gets us, bigger caches. The caches are upped from 4MB to 6MB, and associativity was upped from 16-way to 24-way. In addition the most interesting bit is the advances on loads and stores. The old way of dispatching speculative stores would stall when the data would cross cache lines.
Penryn removes this limit. You can cross cache lines without introducing a long wait, very handy for a lot of multimedia apps that don't use regular accesses to memory. Intel touts motions estimation, basically the core op of video on PCs, as a prime example of this.
Toss in the 1600FSB, and you have a new buzzword, Intel Advanced Smart Cache. That one encompasses the cache size, the associativity and the rest. Combined with the divider, S3 and various other improvements, Intel is expecting a 45% improvement in bandwidth and FP heavy apps.
There are other important bits that raise performance, but not necessarily everywhere. One of the big ones is vastly improved VMEntry and WMExit in VT. You can read more here, but basically Penryn improves this by a claimed 25-75%.
That brings us to power, or its lack. Intel is officially not changing the TDP specs of Penryn, keeping the current 50/80/120W for quads and 40/65/80W for duals. This is a bit misleading, but in a good way. If you noticed the earlier claims from the 45nm process, you saw both speed and power benefits.
Word on the street is that Intel is fudging the TDP numbers up, they could be pulling the 65W parts down to 45-50W and still not be fibbing. The rest of the TDPs would go down accordingly. Basically, these chips are a lot lighter on real world power use than Intel is claiming.
There are also no special enhancements to quad core power use. The comms protocol for QC power management has not been changed much. The big change is a new C6 sleep state, IE deeper than C5. As opposed to C4, it turns the core voltage way down, and turns off the the L1 and L2 caches entirely.
This means when you wake up, you have to refill the caches to some degree, greatly increasing wake up time. There is a middle ground, aka magic pixie dust, in the CPU, so when you wake up, you don't have to refill the entire cache, just a few lines. Additionally, Intel built in a lag when going into and out of this mode to prevent thrashing.
So, how fast is it? One thing Intel didn't mention today was the half clock dividers that Penryn is capable of. Instead of 266/333MHz steps, they can now do 133/166/200MHz steps, allowing for a tighter spread of parts.
Intel kept repeating the >3GHz mantra, but with performance numbers let a few things slip. For desktop, they mentioned 3.2GHz, and the demo today was done on a 3.33GHz part. With the performance lead Intel has, it doesn't make sense to increase clocks a lot until AMD puts out Barcelona. Expect modest clock boosts.
That in a nutshell is the Penryn Family. It has a lot of things changed from Merom, and a lot of things that stayed the same. It is far from a dumb shrink, but nothing near a full redesign, think of it as a massaged Merom with a bunch of extra goodies. µ
