Money will buy a fine dog, but only kindness will make him wag his tail
JUST OVER A YEAR ago, Tilera launched its 64-core chip called the Tile64. A little over a year later, the firm is putting out it's successor called the TilePro64, and that too has 64 cores, but everything else seems revamped.
Before you go rushing out to buy a few, this is not an x86 chip, it is aimed at high performance embedded markets like high bandwidth network twiddling and multi-stream video applications. You probably can't buy one directly, and the things that use them tend not to advertise the fact.
The chip itself has a quite unique architecture, but the overall specs sound quite reasonable. There are two TilePros, one with 64 cores and another with 36. The 64 comes at 700 or 866MHz with 5.6MB of cache while consuming 19-23W. The little brother runs at 500MHz, has 3.2MB cache and sucks 10-16W.
This may seem pretty normal, and so far it is, even the 4 DDR2 controllers, three for the 36, don't raise any red flags. There are two XAUI ports, and two GbE, along with two PCIe ports on the bigger chip, the smaller loses an XAUI and PCIe. XAUI tends to raise a few red flags, that is a telecom oriented interface, not something you would see in a Xeon.
The core itself doesn't have much to set off alarm bells from an overview standpoint, it is RISCy, similar to MIPS, but using it's own ISA. It looks eerily similar to the Tile64 core. There are a few major changes to the core this time around, multimedia, offset load/store, better memcopy, and unaligned loads.
The core looks like this
The multimedia instructions are aimed at audio and video codecs, no surprise given the target markets. Offset loads and stores make video applications go a whole lot smoother as well, and the memory access and copy instructions help with this as well. Unaligned loads do what they say, make sure things that come in on non-cache line boundaries don't destroy performance. Networking apps do a lot of this, packet sizes typically are all over the map.
Moving off core a bit, there are several other improvements to the uncore. The biggest one is that the TilePro64 can now stripe DDR for more bandwidth if needed, but this is user controllable. DDR2 utilization is up a claimed 30 per cent from the older non-Pro version, and memory loading is down as well. You can also now move I/O calls to userspace for lower overhead in VM situations, and copy from I/O directly to the cache of a specific tile coherently.
So far, so normal, but the TilePro is anything but. The magic is in the meshes that connect the tiles. Note the plural, there are now six meshes, up from five in the non-Pro. Each has a function, so one type of traffic will not step on another.
Note the meshes
The new mesh is dedicated to cache coherency, and lets Tilera do some interesting things. There are instructions that allow you to add two numbers and post it directly to an adjacent core in a single clock, things you couldn't dream about in a standard multi-core configuration.
Moving between cores takes one hop per core, and all routing is deterministic. If you need to go two hops away, it will take you two clocks. Six cores, six hops, six clocks, you get the idea. The only problem with this is that you the have potential for variable latencies if you are not careful with what code you run where.
To help this out, Tilera has something it calls DCC or Dynamic Distributed Cache. Each core has an L1 and L2 cache, but it sees all the other cores on the die as a single L3 cache. Instead of going to memory, it checks to see if there is anything resident on chip first. Now you see why the coherency mesh is a real good thing.
Two features help with this, coherence domains and hash for home. Coherence domains is pretty simple, you can carve out a block of cores and assign them to be coherent with each other, so instead of a big L3, cores 32-47 share their cache, and 0-15 do as well.
Cache coherency domains
Hashing for home is a little more sophisticated, and solves the problem of one core having a critical piece of data that all the other cores need. What it does is hash the data and assign it to a core based on that hash value. This distributes the load out to any cores in the coherency domain theoretically equally.
So, with this rather odd architecture, you might think that you need to have a PhD to program it. Tilera saw this little problem coming and built a lot of tools to make up for that, everything is written in C/C++ with an Eclipse based compiler. The tool set is called MDE 2.0 for Multicore Development Environment.
You can either write your code to the bare metal or use one of two Linux distros. The first is nothing special, the same old penguin oriented code you know and love. The second is half way between Linux and bare metal, it is called Zero Overhead Linux, and it tries very hard to stay out of the way. What it does is goes more or less to sleep and guarantees no interrupts until you call it. Quite an interesting idea to have an OS there only when you really need it, and nothing when you don't.
MDE 2.0 is supposed to be compatible with anything written in older MDEs, and the new TilePro64s are socket compatible with the older Tile64s. They will even run the same compiled code without modification, making them a drop in upgrade if you need a bit more speed from the crusty 9 month old cable head end box in the closet.
In terms of performance, Tilera is claiming 221 BOPS (16-bit) and 166 (32-bit) for the 64 core version, and 144/54 for the 36. Not bad for a chip made on a 90nm TSMC process. The first gen CPU was just certified for 15 simultaneous OFDM channels, the Pros should do much better.
Tilera is going after three main markets, networking, multimedia and wireless with the new parts. With a single unit price of $900 for the 64 and $500 for the 36, prices go down with volume, you can tell this is not a part for your home widget. It will be used in service provider applications and other big expensive boxes. You really don't need to do 15 OFDM channels in realtime at home.
There are 45 claimed customers for the first generation parts, only two of which Tilera could name. They are Napatech and Top Layer. Napatech has products on the market now, Top Layer will very soon. For a brand new architecture barely a year old, that is not bad at all.
In the end, Tilera has a really cool idea, and unlike most companies making really cool new architectures, they also have real clients. It isn't hard to see how this new paradigm can be quite powerful as long as you can wrap your head around the hardware. Seeing that there are real products out there using the chips, it looks like the tools they offer do the job promised, an encouraging sign. Hopefully, the TilePro will be the second of a long line of architectures that make you scratch your head and wonder. ยต
That's some really neat stuff. I can see this being really great for simulations - hell of a lot faster than FPGAs. You could gang processors to create variable width vectors or set up sets of processors as virtual domains... very cool.
Once again Charles D. has written intresting & fact filed article. So Moving Charles up to Msr. Charles, in line with Msr. Mike &Msr. Hale.

How'd I explain it. Simple. Say computer was lcd , & you could have ordinary analog switch, No Probs, wild & Crazy resistance variations, Surges & Heat. its osciloscope might be:/----____/\. weell, just something simple up & down, cann't seem to find keys for dia. 
However computer would turn light on in digital with thousands of connection steps. each independent, so there preperation for flucuation of lamp, timing, dimmer, less heat both preparing on & off mode & power routing. So more cores is more preperation & more stable throughout. Each adition of cores & ability to respond to software is one more step in maintaing order& quality. Complexity is its most important part.
drashek
neat, but you can buy neither the chip nor the processor board, what's point?
Hmm,

www.intellasys.net

An old idea, new designs in the works. but does 24 cores full speed (can't say, but similar) for the same power as one full speed tile core uses on average). Of course it is pretty lightweight in processing power, but it has potential with some minor modifications that have been suggested on their forums, and the comp.lang.forth Usenet forums. 

Apparently 100 chips with 24 cores or so each will fit nicely on a little board, let alone one 100 cores on a small chip. Pricing, don't know, $20, dollars or cents in enough quantity?

At the moment it is aimed more at the lowest embedded micro-controller and fpga markets, apparently with somebody with video and audio codec stuff happening in the background already using the existing designs (whatever they are).

However, I think I read ambarella using an array of processors for their video codec, which is much less power. But if GPU's can already perform complex processing and CPU functionality, why are we bothering, why not just make the smallest ATI/Nvidia chip into a micro-controller (of course it is not the most energy efficient).