I had a chat with Sun Fellow and Vice President Marc Tremblay about the upcoming chip, what they are, what they do, and where they will fit in. There are a lot of changes embodied in this chip, and Sun refers to what it represents as disruptive threads, in the sense of what it will do to the marketplace, not to your software. Disruptive is a good thing, especially when you are talking about the price/performance curve.
Let's start out with the first chip in the Niagara family, called strangely, Niagara. On a macro level, it will have eight cores, each core capable of running 4 threads in parallel, for 32 concurrently running threads. Each thread can be a process, or you can have one process running 32 threads, it is up to you. Most likely, the loads will fall in between those, for example running all the threads from a process on a single core. The currency of this chip is the thread, not the MIP.
You can get tricky though, Niagara has the ability to partition the different cores on a chip in the same way you can partition a Sun 15K's processors. If you want two cores dedicated to your web server, and four to the JVM, and two others to the database, no problem, you can do it in the same way you could with multiple sockets. It also has some inherent fault isolation but you can't mirror cores, yet.
Each of these cores has a 24K three cycle L1 cache, split into 16KB Instruction and 8KB Data caches, each 4 way set associative. The I-Cache has a 32 byte line size, the D-Cache a 16 byte lines. The L2 is a little odder than the average cache. It is four way banked with 12 way set associativity, and data is interleaved across the banks in 64 byte lines. The size of the L2 is 3MB, or about 400K per core, but since this is shared, simple maths does not quite tell the whole story. There was a tradeoff between the unspecified core size and the cache, and Sun put the onus on cores. One really interesting thing to think about was when Tremblay said that if you blow your die size budget by two millimetres on a chip like a Pentium 4, no big deal, one per cent over you can live with If you blow it like that on a Niagara core, you multiply that by eight, and suddenly it is a problem.
There are several reasons why the cache may be of less importance on Niagara than in traditional architectures, the threads and the fact that the core features in order execution. If you have a single core capable of executing a single thread, and it has a cache miss, you wait and wait and wait, sometimes hundreds of cycles. With out-of-order execution, it can also have some effects on the whole instruction stream.
Because it is an in-order core, the messiness of out-of-order execution goes away, but so do a lot of the benefits, mostly the ability to hide little memory operations like cache fetches and the ability do other things while that data is being grabbed. You also lose the ability to optimize how the program is executed, the instructions are executed how they are sent, not how it should best be done.
Here is where threading helps a lot. If you have a cache miss and are facing a long wait for something to come back from memory, you just switch to another thread. That thread can execute its instruction stream until it hits a pothole, then it hands execution off to another thread. Intel has the ability to do this between two threads on the Pentium 4 with hyperthreading, and Niagara has four threads running in parallel per core. To make up numbers, if a cache miss takes 100 cycles, and on average each thread can execute for 25 cycles before it needs to hit main memory, in theory you should completely hide memory latency.
In the real world this won't happen. You will take a hit from memory access time, but with four threads, that is greatly minimised. If Sun ran the numbers right, a small cache should be enough to get them by, even if it is not anywhere near the size of a modern single threaded CPU's cache. It is a different idea, and comparing it directly to the structures of current chips is not as valid as you might think.
The cache structure has some interesting effects on the way processes interact, and lets you do things that were impractical in modern multi-CPU systems. Passing data between threads on a core is very fast - it is just an L1 read. Passing data to another process on another core is about 20 times faster than passing it between CPUs on a SMP system. Since the L2 is shared, it just dumps the data to L2, and the next core can read it.
This may not seem like a big deal, it is just faster than current systems, but other than speed, it more or less does the same thing. Inter-thread communication is one of the areas where I think Niagara will have a steep learning curve for programmers to get optimal performance out of the silicon. When a person programs for a current SMP system, there are huge efforts made to localise memory access, and penalties to go to remote CPUs can be huge. Even in the best of systems, this penalty can be substantial, going from a noticeable to effectively bringing things to a grinding halt.
Programmers have for decades worked around these issues, and the thought process for programming massive NUMA system are all geared around working with these limits. The tools are also set to work with these ideas. In comes Niagara and potentially throws this out the window, virtually no penalty for passing data between four threads, and minor delays between cores. It will take some catching up from the software folk.
The massively multithreaded architecture also does things to the cores themselves. With a lessened effective penalty for memory misses you can make your branch prediction less aggressive, which means easier development and a smaller die. The whole architecture of the Niagara family is based around threads, not the other way around. The hardware is built to facilitate massive multi-threading, it is not added on as a feature.
Rather than looking at this as a chip that runs at a given clock, imagine it is a thread engine with optimal groupings. Four threads work well together, almost as if they were one. You can have groups of these groups interacting with a slight delay, but nothing huge, and they can run in parallel with no loss. Instead of the old bumper sticker that said "visualise whirled peas", you could sum Niagara up with the bumper sticker "visualise thread groups".
This whole sea change is no more than words if the OS running on the chip does not effectively use the shortcuts provided by the chip. If you can pass things between threads with no penalty, it does you no good if the OS still schedules things to happen while the threads should be waiting. A disconnect between the hardware and software would be a very bad thing here.
Luckily for Sun, the two main OSes that will run on the chip, Solaris and Linux, are either owned by them, or completely open to them. In the case of Solaris, Marc Tremblay talked about the OS getting feedback directly from the hardware. If a core is overloaded, it could potentially signal the OS to move threads to another core, and if it is underloaded, it could request more work. There are some extremely interesting optimisations that can come from this, and I suspect it will be a subject of academic papers for years to come.
The biggest question in my mind about Niagara had to do with feeding the beast. This is a new chip, and a new paradigm, but it looks to use the same old I/O mechanisms. Niagara will have several, and they would not get into more detail about what that means, DDR2 controllers on board. There should be a lot of bandwidth available to the cores, but is it enough?
Sun has always been known for having a lot of bandwidth in its systems. You absolutely need to if you are going to make machines like the 10K and 15K. Anything more than a few CPUs in a box simply needs the fattest pipe you can throw at them to function. Sun is also known to have individual CPUs that don't fight for the top SPEC numbers individually, let's just say the CPU power to bandwidth ratio is heavily leaning toward the bandwidth side.
When you move to Niagara type chips, Sun is in a unique position. Each Niagara core is said to be at least as powerful as a current UltraSPARC 3 chip. It stands to reason that if you have eight of these in a socket, you need at least eight times the bandwidth of a single core. This means lots of memory controllers and lots of pins. Niagara certainly has that, but since current SPARCs don't need all the bandwidth they have, Sun was able to get away with an unspecified multiple of the current per chip bandwidth for Niagara without compromising performance. Think less than ten times, but that is still a lot.
In the future, other members of the Niagara family will use different memory technologies. One that was mentioned was FB-DIMMs (See here and here and here). FB-DIMMs allow for huge memory capacity with low pin counts, but you take a latency hit. Luckily, Niagara type architectures can mask that latency very well, so they two technologies are an ideal match for each other.
This brings up another interesting tradeoff. In Niagara type designs, if you have 6 FB-DIMM channels, nothing out of the question for a system like this, that gives you 48 DIMMs to plug into the system, 96GB if you use 2GB DIMMs. To make up more numbers, if Niagara costs $1000 and the memory only costs $500 a DIMM, you have $24,000 in memory. This puts the cost of the chip in the same price category as rounding errors, basically if Sun doubles the cost, will anyone notice? In some ways, this is a very enviable position to be in for a chip maker.
In addition to the memory controllers, Sun will be adding in a few other features to the chip, on the first iteration, there will be an Ethernet controller. Future versions are said to have 10Gb Ethernet controllers and on board encryption capabilities. Who would have thought we would have multi-core system-on-a-chip machines before the single ones were available, much less starting the trend in the server space?
If you wonder about a system with multiple Niagaras in them, don't, Niagara is a single chip only family. There is no SMP built in, nor will there ever be, for that you have to wait another two years or so till the Rock family debuts. At that point, the Rock chips will take over from the current UltraSPARC line, and allow multiple CPUs in a box.
So, with a single CPU in a box that allows for massive numbers of concurrent threads, and large memory capacities, what markets is Niagara aimed at? Marc Tremblay repeatedly mentioned that Niagara was 'network facing' not 'data facing', which will be the domain of Rock. This means things that you can hit directly with a web browser which individually do not require huge number crunching ability, but are present in great quantity. Searches, web page serving and streaming media were all mentioned as good candidates. If you need to service thousands of the same task a second, Niagara should shine. If you want to crunch huge databases, wait for Rock.
Throughout this, you may have gotten the impression that the Niagara chips are not massive FP crunching cores, they are meant to pass large amounts of moderate tasks through the system at high speed. When Sun was modelling the chip, it came to an interesting conclusion, that performance is not all that clock sensitive.
The chip was modelled at, 1, 1.5 and 2GHz, and the end result showed that the clock rate had a minimal impact on performance. With that in mind, Niagara was optimised for throughput, not clock speed. The end result will be under 2GHz, but Sun would not give an exact figure.
The last thing is the chip makes far more efficient use of the resources available. If you open up Task Manager in Windows, you will see the CPU use typically hovers around zero, and then spikes to near 100%, and drops back down. This behaviour can be described as peaky, or if you want to be less charitable, very inefficient.
Niagara takes a different approach, one that aims to keep all the resources of the chip as busy as they can be for as long as you have data to do it with. This means in Niagara chips, the peak load will be very close to the average load, and the peak power use will be close to the TDP. If there was anything unused, the chip will throw a thread at it. Look for the first generation of Niagaras to consume around 60 watts, give or take a bit. This is considerably less than most high performance PC CPUs on the market now.
So, what is Sun giving us with Niagara? It is the polar opposite of the Pentium 4. Instead of speed at all costs, it threads at all costs, and clock speed may happen. You don't get the same problems where the only way to pump more data through is to crank up the clock, something which is looking increasingly harder to do. Niagara instead goes wider and slower per thread, but makes up for it in quantity.
If you are thinking of this in terms of a current Xeon or Opteron, the comparison is simply invalid. Any benchmark run on the single treaded chips won't have enough threads to make Niagara flex its muscles, and any benchmark made for Niagara would probably crush a Xeon with the overhead.
Niagara signals the first volley in a new way of thinking about servers, or at least a class of servers. The whole idea of Disruptive Threads, as Marc Tremblay calls, it is quite real, and it will take a while for many people to understand, much less use. Threads, threads and more threads, that is what this chip lives for. ยต