EV8 was the widest-ever general-purpose CPU design, with eight out-of-order instructions per cycle per core. The EV9 follow-on would have added a 16-wide ~100 GFLOPs vector unit, plus provision for multi-core operation to it. After the Alphacide, there were no new designs to emphasise this superwide instruction issue approach, as multicore and GHz clock fight stayed as the focus. Even the most advanced designs for 2007, like Intel Penryn, AMD Barcelona and IBM POWER7, are all still four-issue per core.
If you look at Intel's teraflop chip shown at Beijing IDF, it also adheres to the many-core - in that case, 80 core - approach, without widening the core girth. Pat Gelsinger needed to pump it up to above 4GHz to get to the celebrated 2TFLOP peak on the stage then. It's only GPU vendors that cherish the ultrawide issue and execution, per GPU core if you will, in PC arena these days.
But now, University of Texas at Austin has not just come back with the wide approach, but kas gone way beyond it.
The TRIPS (Tera-op, Reliable, Intelligently-adaptive Processing System), aimed at speeding up industrial, consumer and scientific computing - and military, one may assume, since DARPA is the investor - is the brainchild of professors Stephen Keckler, Doug Burger and Kathryn McKinley and their 30-member team which has worked on the design over the past seven years. The working chip will be shown at the campus on 30 April.
TRIPS uses a new parallel processing approach - Explicit Data Graph Execution (EDGE). Instead of one instruction at a time, EDGE aims to handle large blocks of information all at once. Using many copies of a small number of replicated tiles, you can design the core at almost any width, yet reduce the complexity for easier design, according to the boffins, that is.
The first pilot TRIPS chip has two CPU cores, each issues 16 instructions per cycle with up to 1,024 instructions in flight simultaneously - the numbers are four and six times above the best current general-purpose CPUs. Composable processors can be constructed by aggregating homogeneous processor tiles. Each CPU core can be configured either as single-threaded mode or in a four-thread, multithreaded mode - again, remember the EV8 Alpha? The distributed microarchitecture's tiles communicate via control and operand networks, yet disparate tiles can act cohesively as a single high-performance processor.
The whole monster is fed through a scalable on-chip multi-bank 1MB L2 static non-uniform cache access (NUCA) memory system, composed of 16 64KB memory tiles interconnected via an on-chip network fabric. Each memory tile can act as cache or as a part of physical memory. Two DMA controllers, two DDR channels, and a glueless interconnect complete the picture.
TRIPS' new EDGE ISAs represent a radical departure from the usual instruction sets. It supports large graphs of computation mapped to a flexible hardware substrate, with instructions in each graph communicating directly with other instructions, rather than going through a shared register file - offsetting any execution overheads over a large graph of instructions. Of course, there is a custom compiler to create atomic code from sequential C or Fortran programs, relying on a distributed execution substrate to reduce the communication latency and contention among the tiles.
The first TRIPS motherboard supports four TRIPS CPUs and 8GB DRAM. The first 500MHz prototype chip (see photo above) is done in an archaic IBM 0.13 micron process and achieves a modest 16 GFLOPs - scalable to half a teraflop with 32 chips in parallel. The 0.032 micron final chips in 2009 will aim for a far more wholesome 5 TFLOPs, a good target to have to seriously compete against Intel's Terascala and Larabee chips.
In summary, there still is a chance for non-X86 ISAs to surface and succeed, even in the PC. If the Austin gang can deliver, TRIPS is a good candidate for a very flexible accelerator platform at least and, who knows, maybe more than that. Since many kinds of accelerators are becoming a reality, attached through all kinds of links from PCI to HTX, CPU FSB and soon CSI, it is an opportune time to give new, potentially more efficient, architectures a chance to do the job. ยต