If it was just a Pentium M variant I don't think there'd be such a fuss about it. Intel is portraying this as the biggest change since the original P4, yet there have been several new cores introduced since then including the Pentium M itself. No, this change is bigger.
The change is so big in fact, it's the reason for Apple's processor switch. Indeed the phrase given when Steve Jobs announced the switch, "performance per watt" is the very same phrase being used by Intel spokesmen.
All we know is it's going to be a multi-core, it's also going to be 64 bit and support hyper threading. The problem is trying to do all this at the same isn't going to reduce power consumption, in fact doing all this means power consumption is more likely to increase.
There are ways to decrease power consumption but many of these seem to have been already used in the Pentium M series, they can go further but IBM has already gone beyond this in the Cell and XBox360's PowerPC cores. Perhaps Intel is planning something rather more radical.
The only hint is some comments from Intel apparently saying the processor will be structurally different but will have no problems running the same apps. When has Intel ever had to say this? It can normally be assumed a new core will run the same apps - unless of course, it's radically different.
So, what is Intel up to?
According to the Apple announcement, the reason it is switching is "performance per watt". Steve Jobs showed a graph with PowerPC projected at 15 computation units per watts and Intel's projected at 70 units per watt. Intel must have figured out a way to reduce power consumption 4 fold. How? Can this even be done?
Yes, it can be done but it requires striking changes in the processor design. The forthcoming Cell processor's SPEs at 3.2 GHz use just two to three Watts and yet are said to be just as fast as any desktop processor. I think we can safely assume a future Intel device will not use SPEs instead of x86 processors but they could use some of the same techniques to bring the power consumption down.
Modern microprocessors throw millions of transistors at producing increasingly small performance boosts. The SPEs' designers didn't do this, they only used transistors if they could be shown to produce a large performance boost. The result is in essence the antithesis of modern microprocessor design, the SPEs are very simple with a relatively short pipeline, strictly in-order execution and no branch prediction.
An extremely stripped back x86 design can and has been done but performance doesn't so much suffer as gets tortured to death. Out of order execution seems to be pretty critical to x86 performance, most likely due to the small number of architectural registers. Then there is the x86 instruction decoder which on simple processors takes up a significant amount of room and of course consumes power. Even the stripped back designs can't remove this.
However, there was one company which took a more radical approach and while its processor wasn't exactly blazing fast it was faster than those using the stripped back approach, what's more it didn't include the x86 instruction decoder. That company was Transmeta and its line of processors weren't x86 at all, they were VLIW (Very Long Instruction Word) processors which used "code morphing" software to translate the x86 instructions into their own VLIW instruction set.
Transmeta, however, made mistakes. During execution, its code morphing software would have to keep jumping in to translate the x86 instructions into their VLIW instruction set. The translation code had to be loaded into the CPU from memory and this took up considerable processor time lowering the CPU's potential performance. It could have solved this with additional cache or even a second core but keeping costs down was evidently more important. The important thing is Transmeta proved it could be done, the technique just needs perfecting.
Intel on the other hand can and do build multicore processors and have no hesitation in throwing on huge dollops of cache. The Itanium line, also VLIW, includes processors with a whopping 9MB of cache. Intel can solve the performance problems Transmeta had because this new processor is designed to have multiple cores and while it may not have 9MB it certainly will have several megabytes of cache.
Intel likes to call its technique "EPIC" instead of VLIW but it's the same thing really.
Intel can make a VLIW processor with a large number of small, low power cores and devote one or more of these to translating x86 to the VLIW ISA, they will partly hold the translation software in the bigger cache so it'll rarely need to hit RAM. It could even do this with a dedicated thread per core but that'll need a big shared cache.
Intel has a lot of experience of VLIW processors from its Itanium project which has now been going on for more than a decade. Intel also now has HP's expertise on board as HP's entire Itanium design team was recently transferred to Intel.
Another technology Intel has access to is DEC's FX!32. This was written in the mid 1990s and allowed X86 software to run on Alpha RISC microprocessors. A lot of the Alpha people and technology was transferred to Intel and FX!32 most likely went with it, indeed it has already been developing similar technology to run X86 binaries on Itanium for quite some time now.
It gets better. Both the Itanium and the Transmeta designs were said to be inspired by VLIW designs built in Russia by a company called Elbrus. Intel did a deal with Elbrus in mid 2004 then went on to buy the company in August 2004. The exact nature of the deal is unclear, however, as another company continued and taped out the E2K processor earlier this year.
Most interestingly though is the E2K compiler technology which allows it to run X86 software. This is exactly the sort of technology Intel need and since last year they have had access to it and employ many of it's designers.
So, Intel has access to VLIW technology from the Itanium and HP as well as the translation software from DEC. Most importantly it has the highly advanced technology from Elbrus which has been in development since the 1980s.
The New Architecture
To reduce power you need to reduce the number of transistors, especially ones which don't provide a large performance boost. Switching to VLIW means they can immediately cut out the hefty X86 decoders.
Out of order hardware will go with it as they are huge, consumes masses of power and in VLIW designs are completely unnecessary. The branch predictors may also go on a diet or even get removed completely as the Elbrus compiler can handle even complex branches.
With the X86 baggage gone the hardware can be radically simplified - the limited architectural registers of the x86 will no longer be a limiting factor. Intel could use a design with a single large register file covering integer, floating point and even SSE, 128 x 64 bit registers sounds reasonable (SSE registers could map to 2 x 64 bit registers).
Rumours suggesting the cores will be four issue wide sound perfectly reasonable for a VLIW processor. At least two (Hyper)threads will almost certainly be supported but more would require more registers not to mention giving them something of a naming problem - Ultra- hyper-threading?
You can of course expect all these cores to support 64 bit processing and SSE3, you can also expect there to be lots of them. Intel's current Dothan cores are already tiny but VLIW cores without out of order execution or the large, complex, x86 decoders leave a very small, very low power core. Intel will be able to make processors stuffed to the gills with cores like this.
One interesting aspect of an architecture like this is it gives Intel the ability to learn from it and change it in a way X86 never could.
Changing the basic X86 design would lead to all sorts of difficulties with compatibility so instead, over the years more and more has been added and little if anything removed.
Intel will now be free to do as it pleases with X86 decoding done in software Intel can change the hardware at will. If the processor is weak in a specific area the next generation can be modified without worrying about backwards compatibility. Apart from the speedup nobody will notice the difference. It could even use different types of cores on the same chip for different types of problems.
One thing I do not expect is the new core to be an Itanium derivative, it was not designed for low power. Building a new ISA gives Intel a chance to learn the lessons of the sometimes erratic performance of the Itanium. Not that we'll see the new ISA, this will be hidden from developers underneath the software translation layer. A variant of this device could end up badged as an Itanium though, the software translation should have no trouble converting one VLIW variant to another.
How Fast Will It Be?
Like the Transmeta devices, software will not run at it's full potential until it's been fully translated, you can pretty much bet Intel will make sure third party bench-markers will be made well aware of this. I suspect we may also see speculative translation running in the background so everything gets translated and saved as soon as possible. Once translated, the new binaries are saved to disc, they will run as native VLIW thereafter.
The forte of this processor will be multithreaded code and multitasking. If you are doing lots of things at one you'll be well happy, servers in particular will benefit from this approach. Multitasking will benefit because different cores will get different tasks, a user switching between them will not cause them to halt so responsiveness of systems with this processor will be very good.
Single threaded performance on the other hand could be relatively weak although that's not a given, I expect AMD will hold on to its crown in single threaded performance for now.
Based on the various comments and actions of Intel, as well as other companies, I think Intel is preparing to announce a completely new VLIW processor which uses software to decode x86 instructions and order their execution. It might be relatively weak on single threaded code but it'll more than make up for it in numbers, heavily multithreaded code should run very nicely indeed.
We'll see shortly if my speculation is correct however, multiple processor vendors are already going in the same direction with a large number of simple cores. X86 hardware implementations don't lend themselves to the simplicity required for large multicore devices, a VLIW approach has already been shown to be workable whilst reducing both power consumption and size.
Historically, Intel has often used new techniques after it's been used by other vendors. Its real strength is taking those ideas, improving them then mass manufacturing them.
I expect Intel will apply its full manufacturing skills to this device - this processor could have as many as 16 cores.
To date, Apple's CPU switch to Intel has prompted a lot of speculation about the real reason as frankly, it didn't made much sense. But if this speculation turns out to be true, reasons behind Apple's switch are obvious. µ
© Nicholas Blachford 2005
Sign up for INQbot – a weekly roundup of the best from the INQ