First you need to understand the problem, and that goes back to the 3W 486DX of the late 90s. Compare that to the 115W PD of last year, 30x or more power use in less than 20 years, but luckily for you and I, performance has gone up much more than that. Conroe has pushed the performance per watt (PPW) curve up by almost 100x since those 486 days.
Let's start out with a little math, power used is proportional to the dynamic capacitance times the voltage squared times the frequency (C * V^2 * F). If you consider the dynamic capacitance to be a reflection of the architecture, you can see that dropping the voltage a little will overwhelm any minor changes you can make to the chip itself. Voltage also overwhelms frequency when it comes to power, but frequency is a lot easier to twiddle.
What it comes down to is that the chip architecture isn't likely to change much without huge cost, voltage is relatively fixed at a given frequency, and frequency can be twiddled a lot. The nice thing is that within certain limits, you can drop the voltage along with the frequency, and that has a huge payback. In essence, if you make a chip 10% slower, you end up with a CPU that consumes .9^3x or about 73% the power. If you ask an architect to get you the same decrease from losing 10% of the performance, they will look at you like you are crazy.
There are many ways you can use architecture to go some of this way, and most of those showed up in Banias and Merom, with improvements along the way between them. Merom made things wider, from the three issues on Banias to the four of Core 2 The Elder. It also uses Macro- and Micro-Op Fusion to make multiple instructions into one.
If you take two instructions and mash it into one for execution, you only need to use one execution unit to do it. You can take the second unit that was going to be occupied and use it for another op increasing performance or you can power that unit down and save wattage, depending on your goals. If you combine them a bit and drop the overall clock while reducing voltage, instead of a little win for a turned off unit, you get the ^3 power factor savings from downclock and downvolting.
One other way to scale this up is instead of having a single big fast core, see the P4 for example, you can have four simpler cores running much slower for the same transistor count and die space. If your workload is amenable, you will end up with massive power savings with four cores.
You can also do power tricks with four cores similar to the execution unit tricks on the microarchitectural level. If you are at 10% workload, you can turn off three cores and have the fourth running at half speed. This will most likely end up using far less power than a big core with as much of it as possible shut down.
In an ironic twist, Intel showed a four-core die and said that could clock all cores independently. This was a 'future' idea that no one has commercially implemented it yet, and it was a good thing. Oddly, at Spring Processor Forum, AMD showed that Barcelona did just that, and I have it on good authority that AMD will be going into great depth on this topic in an hour. Far future is 60 minutes or so.
That brings us to the main waster of power in modern systems, and it is not the CPU but the PSU. A modern power supply is about 75% efficient, and that means about half of the power wasted in a system comes from there. The closer you can keep the PSU to running in it's most efficient range, the better off you will be. This means a CPU feeding power use, heat and load information to the PSU can have huge wins. Intel is doing much of this already, from thermal sensors on the die to DBS on mobile and now desktop chips.
On top of that, there are several initiatives to bump the PSU efficiency to 80% or more, with 90% achievable now for a price. One of the more interesting ways to do this is a load adaptive power supply, basically instead of a 720W server PSU, you have three 240W PSUs that turn on sequentially. The higher the load on a PSU, the more efficient, so you end up with a multiple small units each running flat out or turned off.
What it comes down to is there is not a single magic bullet, and the gains will come from all places a little over time. Some things add more than others, but to an extent, they all work together. Nothing is free however, and work will continue for the foreseeable future to improve PPW. µ
Sign up for INQbot – a weekly roundup of the best from the INQ