Then, around '91-'92, came processors like MIPS R4000, Alpha 21064 and the Pentium, which were the first among the first general-purpose market "superscalar" models, where two instructions were processed - fetched, executed, retired - per CPU cycle.
Later, with the Alpha 21164 in 1995, we had the first CPU doing four instructions per cycle, all that at 300MHz, a mind-boggling clock for the time. The Microprocessor Forum presentation, at which the 21164 was introduced was like a wet dream for half the audience, and a living nightmare for the other half - mostly Intel and IBM gang, not to mention Sun SPARClers. Too bad it was Alpha that lost at the end.
Anyway, all these guys were doing things nicely and regularly, in order - as the program instruction flow went, the opcodes were fetched, executed and retired, in block of two or four.
Now, different instructions require different execution resources, and, more often than not, the program flow will be such that execution will wait for a resource to be freed, or a dependency to be resolved, to proceed with the next instruction. Worse, with every new CPU generation, the code would need to be recompiled to optimise for the new CPU, otherwise you'd risk way too many idle-time bubbles, leading to less performance benefit and so, less advantage vs competition.
That's where out of order execution came in. The CPU hardware itself reorders the instructions after fetching, according to available resources, providing more execution units, renaming registers, taking care of dependencies and so on. So, almost all new CPUs from Pentium Pro and Alpha 21264 onward, were of the out-of-order type.
The speedups were good in many cases, and tremendous in some. The 21264 was nearly twice as fast as the 21164, while the Pentium Pro was also quite a bit ahead of the Pentium. The last major MIPS revision, the R10000, was also out of order.
Things continued fine until a brand new thingie came out of Satan Clara. The good ship Itanic had a truly unique engine of EPIC significance. Put aside all the instruction grouping, over 100 instruction format combinations, huge slow register sets, etc, it was, basically, back In Order. So, again, it was the compiler that had to do all the crafty work to ensure the execution units are kept busy. That, excepting FP-intensive apps, wasn't exactly easy, seeing the Itanium system benchmarks.
Throughout its refreshes, this aspect of the Itanic architecture never changed - Sun at one time gave in to Fujitsu's out-or-order SPARC64 compared to its own in-order UltraSparc IV. All other major architectures (read: X86) stayed out of order, with new engines like Core 2 and K10, just further enhancing this approach to squeeze more out of each MHz.
Isn't POWER the other major architecture? Yes it is, at least if you need AIX for some reason. POWER4 and POWER5 were fast yet complex out-of-order RISC machines, combining 4-way superscalar execution and very high system bandwidth. However, Power6 (the new spelling) is going back to the in-order times. Why?
One answer is that, if its simultaneous multithreading is effective, there's less concern over a single thread wasting the execution resources: in such case, simply run two threads simultaneously. Also, to reach further massive performance jump, things like twice the frequency, doubled secondary cache and improved ALU latency could take more priority. Even then, there is some out-of-order capability left in the FP portion - that part of the CPU that, for the first time in general-purpose processors, has a full decimal FPU too! "Simultaneous dual-threaded execution, load lookahead, and enhanced data and instruction prefetch capabilities drive the performance of the in-order superscalar cores." That's what IBM says about the new chips. The 5-way out-of-order POWER5+ execution was replaced by 7-way in-order in Power6, but even then, there's a catch: one thread can have a maximum of five instructions per cycle, while the other thread adds the remaining two - this is fine for, say, a combination of computational and memory-search threads. One is more focused on internal resources, while the other, waits for the memory most of the time anyway, so two ops per cycle is more than sufficient. What do you get out of it performance wise? If you look at the specfp2006, Power5+ at 2.2GHz reached 14.9 specfp2006, while Power6 reaches 22.3 specfp2006 at 4.7GHz - but of course, in an adapted Power5 machine. In summary, just under half additional speed at well over twice the clock.
So, the 790 million trannies of Power6, spread over comparatively huge 341mm2 - more than the 283mm2 of Barcelona/Agena, and just a bit less than the huge bulk of Itanic the cruiser - do give quite a bit of extra oomph despite the loss of out-of-orderliness, but, knowing that the cache and memory bandwidth both went up in sync with the clock speed increase, I'd say that going back to in-order has resulted in about 30 per cent performance loss clock for clock on single-thread tasks.
We of course have to wait for newer, Power6-native systems, as well as compiler improvements in the next AIX revs, to reduce this loss. Nevertheless, in the case of Power6, standing In Order did repay itself, as the overall gain is there, obviously, looking at the world's highest clock - not Big Ben, but GHz frequency in a CPU. That same In Order policy didn't repay itself - at least, yet - in the Itanic's case, and it will probably never see the light of day again in the X86 world, where the architecture underlinings are still - let's be frank - as crappy as any CPU platform could ever be.
Having said all that, the Power6 gang shouldn't sit idle. The 3.6GHz Harpertown "Penryn" and 3GHz Barcelona will, in a quarter or two, pose serious performance challenges to the newest IBMer on the block - at least in synthetic benchmarks, and out-of-orderliness does matter there. IBM also has to bear in mind that, with In Order machines like this, more compiler work will be needed as each successive CPU generation comes out, with changed instruction issue and execution resources. Would everyone have time to recompile the apps? µ