The Inquirer-Home

Here's technical look at Niagara II

Take a deep breath
Mon Sep 10 2007, 23:10
SUN ANNOUNCED Niagara 2 the other day, an evolution of the older Niagara 1, now called the UltraSPARC T1. From the 10,000-foot view, it all looks quite familiar, but once you delve into the details, it quickly becomes apparent that almost everything has changed.

The gross overview is eight in-order cores on a die, each capable of running multiple threads, same as Niagara (N1). Niagara 2 (N2) does quite a lot more, it executes two groups of four threads per core instead of four, and pulls large chunks of the north bridge on board.

In terms of raw die size, it hasn't changed all that much, Niagara was 378 mm^2 and Niagara 2 is 342 mm^2. Once you factor in that Niagara the elder was on a 90nm process and Niagara 2 is 65, you can see that the new chip would be about twice as big on a similar process. It also runs at a similar speed, 1.2 and 1.4GHz are the launch fequencies.


The package is the main physical difference, the old chip had almost 2000 pins, 1933 to be exact, 1111 or of which were signal I/O. Niagara 2 has 1831, 711 are signal I/O, quite a drop for increased functionality. The key to this pin drop is memory type changes, N2 went from N1's four DDR2 channels to four dual channel FB-DIMM(FBD) controllers. The pin count goes from 850 for DDR2 to 432 for FBD. This is where most of the pin count drop comes from, and it validates a lot of what Intel initially claimed for FBD. You get twice the bandwidth for half the pins.

The chip itself still has eight physical cores, but each of those is highly modified. Instead of having a single integer execution pipe per core and a separate FP unit shared among all eight as N1 did, N2 has two full pipes. These pipes are not a duplication of the core itself, they share a lot of resources both before and after the part of the pipeline that does the computation.

Instead of putting in 16 cores each capable of our threads, Sun put in eight cores capable of eight threads each. The cores can execute two instructions, one per pipeline, and each core has a full floating point unit this time around. This saves a ton of die space both for cores and supporting units. How those threads are loaded, grouped and executed is quite different from N1.

Each core has a 16K L1 I (instruction) cache and an 8K L1 D (data) cache with associativity of the I-cache upped to eight from the four of N1. The D-cache has four-way associativity and is write-through, and all of the cores share a 4MB L2. As you can see above, this is divided into 8 banks and each bank is 16-way associative, N1 had 4 banks that were 12 way associative. Sun claims that if they stuck with the four banks of N1, you would have seen up to a 15% performance loss.

The last new part of the core added was a cryptographic engine that operates more or less in parallel with the main pipelines. The goal was to be able to keep both of the 10GigE ports sending and receiving fully encrypted data for 'free'. It is quite an interesting architecture.

The cores are connected to the memory banks with an asymmetric 8x9 crossbar. The 8 side of things are the cores, the 9 are the L2 banks and another port for I/O. Each pair of cache banks is connected to an FBD controller, which in turn is connected to two FBD channels. There is a ton of bandwidth here.


It takes three steps for the crossbar to start passing data, they are Request, Arbitrate and Grant. Each takes a clock, and consecutive writes don't need to re-arbitrate. Each core can have 16 cache misses outstanding, 8 from the I and 8 from the D cache. The crossbar is completely non-blocking. In case of conflict, priority is given to the oldest request.

The asymmetry comes in because when the cores read, there is 180 GB/s of bandwidth available or reading but when they write, there is half of that, 90GB/s on tap. This is because in modern computing, there are more reads than writes. FBD has a similar favoring of read bandwidth.

Also peripheral to the core is integrated PCIe, in this case there are eight lanes. As you can see, they are connected through the System Interface Unit (SIU) which is tied directly in to the L2s and the I/O crossbar port. The two 10 GigE ports are similarly connected, and in aggregate have the same bandwidth.


Moving back to the cores themselves, the pipelines have a new stage called pick. This stage picks up to two threads, one from each of two thread groups for execution. Overall, the integer pipe is 8 stages, the FP pipe is 12. There are a few instructions like fdiv and sqrt that take longer, 22 clocks for single precision, 36 for double but most take only 8 or 12 clocks. How the instructions are executed is a big change from N1. If you recall, N1 simply did a round robin switch, it executed thread 0 for a bit, then went on to 1, then 2 and 3. It was a pretty simple scheme, if a thread was waiting, it skipped it, and had a Least Recently Used (LRU) algorithm for picking the next thread. N2 does things in a much more intelligent fashion. In the added pick stage of the pipeline, the core does just that, and each of the two pipes picks a thread to execute. An LRU algorithm is employed to ensure fairness as well.

The two pipelines, called EXUs in Sun jargon, each can execute four threads at a time for a total of eight per core. These threads are grouped into two groups of four threads each, and threads can not move between groups. The picker picks one out of a pool of four to run each clock.

To smooth things out, the picker will skip over stalled threads. If a thread is waiting on memory or something else, it will flag this status and the picker will move over it. In any case, the picker works on a single cycle granularity, ideally it will give an execution slot to a thread every fourth cycle.

One other thing the picker does is resolves conflicts before they happen. There are actually two pickers, one for each pipeline, and it looks at one thread group. Potential conflicts tend to be avoided before they happen, smoothing things out. The picker will only pick threads in a ready state, other conflicts are avoided at the decode stage.

Here for example, if the decoder sees two loads at the same time, it will stall one for a clock because there is only one LSU per core. Avoiding conflicts before they happen is a much better strategy than fixing them while they are happening.

The second to last pipeline stage is called Bypass. What this does is that if a load hits the D cache, the result of the operation will be forwarded to it directly from the bypass stage rather than being stored to the cache hierarchy and reloaded. This basically saves a round trip to the L1 and the associated overhead.

To insure that threads will always have something to execute, N2 has an Instruction Fetch Unit (IFU) that fetches up to four instructions per cycle with a two cycle latency. Prefetched instructions go into an instruction buffer specific to a thread, and each holds eight instructions. Because they are so small and specific to the thread, they can be made fast and simplifies the picker, a win/win for N2 designers.

One interesting note, the picker and the IFU are separate units. This means that the IFU can fetch an instruction independently of the pick unit. If you think about the instructions being a pool, the IFU can put things in the pool while the picker takes them out, and they will not step on each other's toes.

The theoretical result is that if there are no other conflicts, the picker will pick what it can when they are ready, and things only get sent down the pipe when all the data they need is available. Ideally, there will be no pipeline flushes, but in the real world, that never happens. The number of flushes should be greatly reduced, replaced with other threads or simply nothing if all threads are flagged as waiting.

That brings us to the parts of the core that talk to the outside world, networking. N2 has two 10/1 GigE ports with a few unique features. First, all data is sourced from and destined to main memory, DMA in the parlance. This means a core sets up the transfer and gets out of the way.

If you look at the crossbar picture above, you will see that the path to memory goes from the Ethernet unit (NIU), to the SIU, directly into the L2 or the crossbar. The CPU does not need to get in the way of things, it can just set up the DMA and move back to number crunching. This fits in well with the parallel design of Niagara, everything is in motion at once.

One of the more important realizations of this philosophy is the Stream Processing Unit (SPU), the place where all the cryptography happens. This is rather unique in modern CPUs, only Via has something similar in their x86 processors. Each CPU has an SPU, it sits off to the side of the core itself, sharing only the crossbar port and some FPU functionality.

The point of it is nothing less than full line speed (20 Gbps) encryption for 'free'. In essence, each core can do eight threads plus crypto at the same time. Sun thinks that if encryption is a no-cost checkbox, people will use it, and it can do a lot of good. I find this point hard to argue with.

There are two units in the SPU, modular arithmetic, and a hash/cipher engine. The modular arithmetic does things like RSA and Elliptic Curve(ECC) while the other two handle RC4, DES/3DES, AES-128/192/256, MD-5, SHA-1 and SHA-256. Modular arithmetic and ciphers/hashes can be done in parallel, so if you need both, N2 will be quite fast.

Cryptography works almost totally independently from the main functions, the cores set up a control word in memory. It has pointers to the source and result locations, keys, and all the other necessary info to complete the operation. Each core has one queue, and the SPU is hyperpriveleged, it can only be programmed by the hypervisor or things with the same privilege level. This allows it to be shared by all eight threads on a core while maintaining security and separation among the threads.

The SPU goes along it's merry way encrypting and decrypting autonomously, the core is only interrupted set up a packet now and again. The packets are queued in main memory so they can be written or read without the other side having to pause.

How do they do? Well, they achieve the goals of line speed encryption, and in most cases, well exceed the magic 20Gbps that you need to flood the NICs. For AES, you can do about 44Gbps with a 128b key, 256b drops that to around 30Gbps. RC4 flies through at 80Gbps and Hashes are at around 40Gbps. These are aggregate numbers for all eight SPUs, one for each core.

For the modular arithmetic side of the house, the SPUs can crank out about 92K ECC keys per second with a 163 bit key length and up to 37K RSA-1024 operations per second.

On paper, Sun looks like it hit the mark with N2, and should achieve line speed encryption for a very low cost, if not for the 'free' they were aiming for. My guess is you will run out of CPU power before you run out of results to encrypt.

To get this in and out of the chip, the NICs have some interesting features that tie into all of this as well. First off, they have 32 DMA channels, 16 transmit and 16 receive so there can be a lot of things in flight. Packets are classified at line rate on layers 1-4 of the stack and assigned to their owning thread at a rate of about 30M packets per second. The NIC is fully thread aware, and works with rather than against the core threading philosophy.

In addition, the NICs are virtualization friendly, interrupts can be bound to hardware threads directly. What you end up with is a smooth flow of data from the NIC to memory to the thread directly with little or no work needed to ensure it gets where it should go. Not only does the chip have all the pieces, but they appear to work together correctly instead of fighting.

With all of this going on, one might expect that there is a fair bit of power consumed. The process shrink helps a lot here, so with about twice the transistors, you end up in the same power envelope as N1. The chip takes about 60W typical with an absolute worst pathological case of 123W.

N2 also has some power saving features, the most familiar is probably going to be thermal throttling. If the core's two thermal diodes get too hot, there will be stalls injected into the instruction stream. There are three pins that control this, and you can inject from 0-7 stalls in a window of 8 clocks. If the decoder idles from a stall, there is a bubble of no work that passes down the pipeline reducing power used.

While this is the most obvious power saving feature, there are others. N2 has a lot of dynamic clock gating and can do some instruction speculation to save power. In general, where N1 was more or less passive in power management, N2 does things in a much more active fashion.

What you end up with is a new chip that outdoes the old by a large margin, not the usual incremental bump. Single threaded performance is said to go up about 1.4x and performance per watt has gone up by 2x, likely to be overwhelmed on a system level by FBD power though. FP throughput went up by over 10x, but considering there are 8x the units, this is not as big a deal as it might appear.

It looks like Sun delivered on what it set out to do. The shortcomings of Niagara were addressed in Niagara 2, and a lot of interesting features were added. Sun is positioning N2 as a network facing chip, not a back end number cruncher, and it looks to be able to fill that role with far fewer caveats than it's predecessor. µ


Share this:

blog comments powered by Disqus
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

INQ Poll

Heartbleed bug discovered in OpenSSL

Have you reacted to Heartbleed?