Anyone who has compared the Hammer and K7 pipelines will note that Hammer pipeline has two additional stages.
The two additional stages consist of a second instruction decoder stage and a buffer stage for decoded micro-ops. Myself and other have previously hypothesized that this is so because the K7 is slow in real-time, and since each pipeline stage has only one clock cycle to complete its task it was holding back the clock speed of the K7. Basically we thought that the extra stages were there to help Hammer run at higher clock speeds.
One thing bothered me however. The TLB buffer in the Hammer pipeline doesn't really make sense, the penalty of that extra pipeline stage would far outweigh the few benefits it would seem. So I thought about it, I tried to ponder what purpose Hammer would really have for these two additional stages. And then I stumbled on to a theory of my own.
It is no secret that instruction parallelity, or lack of it, is one of the problems that hold IA32 CPUs back. You see, the Athlon can theoretically decode and execute three instructions per clock cycle, but poor instruction parallelism keeps it closer to two instructions per clock cycle in most instances. It is very rare that all three execution slots are full.
Intel's approach to reducing the penalty of poor instruction parallelity is HyperThreading. Rather than executing a single application thread at any given moment and allowing the operating system to shuffle through threads, allocating CPU time to one thread at a time, HyperThreading allows two threads to occupy the processor pipeline simultaneously. This ensures that more of the execution slots are full more often. The downside is however, that only multi-threaded applications benefit much from HyperThreading. A system running many applications simultaneously benefits from HyperThreading to a degree also.
It seems to me that a better approach to reducing the penalty of poor instruction parallelism would be to just decode more instructions per clock cycle. One way to attempt that would be to add additional execution slots, to build a decoder capable of say decoding 4 instructions per clock cycle. The problem with this approach however is that the fourth execution slot would be filled even less often than the third, so you would gain very little performance for the additional die area and transistor count. Another better approach would be to somehow "double-pump" the decoder, or run it at twice the speed of rest of the CPU.
How would you theoretically go about doubling decoder speed? One would need to double the number of decoder stages, effectively giving the CPU twice as many clock cycles to decode an instruction in to micro-ops. In the Athlon's case you would increase the number of decoder stages from one to two. You would than need to reengineer the cache system to interface with a double speed decoder. You would also need to engineer a buffer to allow the decoder to interface with the ALUs and FPUs since they are running at half the clock frequency.
Hammer has two decoder stages and a reengineered cache system. It also has a buffer stage between the decoder and the ALUs and FPUs.
So how much would this improve performance? It would likely allow Hammer to decode four instructions per clock cycle much of the time, rarely less than three, and Hammer has the ability to execute 3 instructions per clock cycle. So you would be looking at approximately 50% more instructions per clock cycle.
What performance drawbacks will two additional pipeline stages create, and how might they be overcome? The first drawback is one that was highly publicized at the Pentium 4's time of introduction, that is branch misprediction penalties. The more stages a pipeline has, the greater the mispredict penalty. Adding two stages to the Athlon's pipeline will increase the branch misprediction penalty by 20%. The solution to this is quite obvious, engineer a more accurate branch predictor so that mispredicts are less frequent. Hammer has a new branch predictor, this we know. The second drawback is execution latency, or the number of clock cycles it takes to decode and execute an instruction, and the negative effect that high latency has on dependant instruction, or instruction which depend on the execution of previous instruction(s) before they can be executed. It takes a 20-stage pipeline 20 clock cycles to decode and execute a particular instruction even though an instruction enters and exits the pipeline every clock cycle. It takes a 10-stage pipeline 10 clock cycles to execute a particular instruction. 12 clock cycles will elapse from the time a particular instruction enters a 12-stage pipeline until the time the results exit the pipeline.
The solution to this is quite obvious as well, use a more advanced lithography to run the CPU at a 20% higher clock speed. We know Hammer will be manufactured using silicon on insulator (SOI) lithography, this lithography is supposed to allow transistors to switch as much as 30% faster than traditional CMOS transistors of the same scale.
So is this what Hammer will be? I can't claim to know that, but I suspect it, which is why I wrote this. Could I be dead wrong? Absolutely! I'm not afraid to admit that I may be wrong. I'm admitting that I may be wrong now, so I don't want any flame mail to find its way to my inbox. I wouldn't mind some intelligent feedback though. µ
Sign up for INQbot – a weekly roundup of the best from the INQ