The only problem [Nvidia has] is that at some point your eyes don't get any better - Bob Colwell, former chief architect, Intel
JUST OVER A YEAR ago, Tilera launched its 64-core chip called the Tile64. A little over a year later, the firm is putting out it's successor called the TilePro64, and that too has 64 cores, but everything else seems revamped.
Before you go rushing out to buy a few, this is not an x86 chip, it is aimed at high performance embedded markets like high bandwidth network twiddling and multi-stream video applications. You probably can't buy one directly, and the things that use them tend not to advertise the fact.
The chip itself has a quite unique architecture, but the overall specs sound quite reasonable. There are two TilePros, one with 64 cores and another with 36. The 64 comes at 700 or 866MHz with 5.6MB of cache while consuming 19-23W. The little brother runs at 500MHz, has 3.2MB cache and sucks 10-16W.
This may seem pretty normal, and so far it is, even the 4 DDR2 controllers, three for the 36, don't raise any red flags. There are two XAUI ports, and two GbE, along with two PCIe ports on the bigger chip, the smaller loses an XAUI and PCIe. XAUI tends to raise a few red flags, that is a telecom oriented interface, not something you would see in a Xeon.
The core itself doesn't have much to set off alarm bells from an overview standpoint, it is RISCy, similar to MIPS, but using it's own ISA. It looks eerily similar to the Tile64 core. There are a few major changes to the core this time around, multimedia, offset load/store, better memcopy, and unaligned loads.
The core looks like this
The multimedia instructions are aimed at audio and video codecs, no surprise given the target markets. Offset loads and stores make video applications go a whole lot smoother as well, and the memory access and copy instructions help with this as well. Unaligned loads do what they say, make sure things that come in on non-cache line boundaries don't destroy performance. Networking apps do a lot of this, packet sizes typically are all over the map.
Moving off core a bit, there are several other improvements to the uncore. The biggest one is that the TilePro64 can now stripe DDR for more bandwidth if needed, but this is user controllable. DDR2 utilization is up a claimed 30 per cent from the older non-Pro version, and memory loading is down as well. You can also now move I/O calls to userspace for lower overhead in VM situations, and copy from I/O directly to the cache of a specific tile coherently.
So far, so normal, but the TilePro is anything but. The magic is in the meshes that connect the tiles. Note the plural, there are now six meshes, up from five in the non-Pro. Each has a function, so one type of traffic will not step on another.
Note the meshes
The new mesh is dedicated to cache coherency, and lets Tilera do some interesting things. There are instructions that allow you to add two numbers and post it directly to an adjacent core in a single clock, things you couldn't dream about in a standard multi-core configuration.
Moving between cores takes one hop per core, and all routing is deterministic. If you need to go two hops away, it will take you two clocks. Six cores, six hops, six clocks, you get the idea. The only problem with this is that you the have potential for variable latencies if you are not careful with what code you run where.
To help this out, Tilera has something it calls DCC or Dynamic Distributed Cache. Each core has an L1 and L2 cache, but it sees all the other cores on the die as a single L3 cache. Instead of going to memory, it checks to see if there is anything resident on chip first. Now you see why the coherency mesh is a real good thing.
Two features help with this, coherence domains and hash for home. Coherence domains is pretty simple, you can carve out a block of cores and assign them to be coherent with each other, so instead of a big L3, cores 32-47 share their cache, and 0-15 do as well.
Cache coherency domains
Hashing for home is a little more sophisticated, and solves the problem of one core having a critical piece of data that all the other cores need. What it does is hash the data and assign it to a core based on that hash value. This distributes the load out to any cores in the coherency domain theoretically equally.
So, with this rather odd architecture, you might think that you need to have a PhD to program it. Tilera saw this little problem coming and built a lot of tools to make up for that, everything is written in C/C++ with an Eclipse based compiler. The tool set is called MDE 2.0 for Multicore Development Environment.
You can either write your code to the bare metal or use one of two Linux distros. The first is nothing special, the same old penguin oriented code you know and love. The second is half way between Linux and bare metal, it is called Zero Overhead Linux, and it tries very hard to stay out of the way. What it does is goes more or less to sleep and guarantees no interrupts until you call it. Quite an interesting idea to have an OS there only when you really need it, and nothing when you don't.
MDE 2.0 is supposed to be compatible with anything written in older MDEs, and the new TilePro64s are socket compatible with the older Tile64s. They will even run the same compiled code without modification, making them a drop in upgrade if you need a bit more speed from the crusty 9 month old cable head end box in the closet.
In terms of performance, Tilera is claiming 221 BOPS (16-bit) and 166 (32-bit) for the 64 core version, and 144/54 for the 36. Not bad for a chip made on a 90nm TSMC process. The first gen CPU was just certified for 15 simultaneous OFDM channels, the Pros should do much better.
Tilera is going after three main markets, networking, multimedia and wireless with the new parts. With a single unit price of $900 for the 64 and $500 for the 36, prices go down with volume, you can tell this is not a part for your home widget. It will be used in service provider applications and other big expensive boxes. You really don't need to do 15 OFDM channels in realtime at home.
There are 45 claimed customers for the first generation parts, only two of which Tilera could name. They are Napatech and Top Layer. Napatech has products on the market now, Top Layer will very soon. For a brand new architecture barely a year old, that is not bad at all.
In the end, Tilera has a really cool idea, and unlike most companies making really cool new architectures, they also have real clients. It isn't hard to see how this new paradigm can be quite powerful as long as you can wrap your head around the hardware. Seeing that there are real products out there using the chips, it looks like the tools they offer do the job promised, an encouraging sign. Hopefully, the TilePro will be the second of a long line of architectures that make you scratch your head and wonder. µ
Sign up for INQbot – a weekly roundup of the best from the INQ