The Inquirer-Home

PA Semi Power chips are full of eastern promise

Fries Turkish Delight
Wed Oct 26 2005, 11:21
PA SEMI set out to make a low power high performance chip by doing the right things for the right reasons, not for fashion. We think it's succeeded in most if not all of its goals.

But before we get into the silicon, let's start out with the carbon side of things, those three industry veterans, Dan Dobberpuhl, Jim Keller and Pete Bannon. They all have resumes that read like a roadmap to ground breaking products, any single one of which would be enough to make you sit up and take notice. PA Semi has three of these people.

Dobberpuhl was at Digital where he led development of such things as the LSI-11 and the µVAX. He then went on to the Alpha and StrongARM , both of which are Intel products now. One of these is still breathing, in fact it's quite a vibrant product, the other is legendary for the opposite reason. From there, he went and founded SiByte which did multi-core shared cache products while most of the industry was wondering why anyone would bother.

Jim Keller was another Alpha guy, and then went on from there to AMD. You may recognise some of his handywork in the Opteron, AMD64 extensions and Hypertransport. He then headed to SiByte. Pete Bannon is no less storied, and up until recently had Intel Fellow at the top of his resume. He was also an Alpha guy from EV5 to EV7, then went on to Itanium work, but had no SiByte experience.

So, with enough heavyweights on board to choke a horse, what are they going to do? The experience in high end computer CPUs is hard to miss, so logically, you would think they are going to aim for the top. They are, sort of, but not by going after the top of the RISC market, they are going for the top of the embedded market.

The word embedded carries quite a stigma in the X86 world, they tend to look at embedded CPUs as the proverbial redheaded stepchild of the industry. Before you join them, try making a phone call, starting your car or running your dishwasher without one. PA Semi is aiming for the processing gigabytes of mission critical data a second market more than the squeaky clean glassware end of the industry, and it is a fiercely competitive market.

To do battle here, PA Semi is bringing a rather unique PowerPC based CPU to the market. It was designed from the ground up to be a low power and very high performance part, performance per watt was never far from anyone's lips during the design. It started from scratch and pulled in nearly the entire computer onto the die, then added a lot of special sauce and a whole shedload of accelerators.

To start out, PA Semi is a PowerPC architectural licensee, basically the most flexible license you can get from IBM for the architecture. What this means is that IBM did not give it a core layout to tweak and add to, it got a book of specs. From there, it is up to the engineers to do whatever they felt necessary to get a CPU out that followed that spec.

Very few companies have the willpower, much less the brainpower, to do this level of engineering. The majority of them just tweak the designs that are out there, adding a bit here and there, and call it their own. The architectural licence allows utter free reign at the cost of development resources. PA Semi took that challenge and ran with it, coming up with nothing less than a totally unique design, and called it the PWRficient family.

The first chip in the family is called the PA6T-1682M - a dual core PPC running at 2GHz. It has 2MB of L2 cache, 2 DDR2 controllers, and 8 PCIe controllers. For networking, there are 2 10 GbE controllers and 4 GbE controllers, all of which share 24 SERDES lanes with the PCIe. It also has an encryption engine, an iSCSI accelerator, a TCP/IP offload engine, and a RAID XOR engine. It is basically a system on a chip with a bunch of accelerator cards pulled in for good measure. All this uses a mere 13 watts typical.

The-virgo-test-chip

There are also two advanced interconnect setups, one for the CPU and one for the I/O and accelerators. The CPU side is a crossbar setup called CONEXIUM that ties in the CPU cores, caches and memory controllers. The I/O interconnect is called ENVOI, and it connects the PCIe, Ethernet, offload engines and the SERDES lanes.

One of the main enablers of this extremely high level of integration was the decision to fab the chip at 65nm to start out with. In addition to the attention paid to controlling power use on every level, the process also helped in on the power front. The overall product is the PWRficient family of chips, the PA6T-1682M is simply the first of many.

Pa6t-1682m-block-diagram

Lets look at the parts individually first, there are a lot of good things there. Then we can go on to the interconnects in more detail, and close with the things you can buy, the chips. There are advances at every stage and every combination of stages, nothing was simply done the old way.

The part that is debatably the heart of the chip is a PowerPC core called the PA6T, a full blown 64-bit PPC with an FPU, VMX extensions and hypervisor support. It fully conforms to the PowerPC 2.04 architecture spec, and can operate in both big and little-endian modes.

Going into a little more detail, it is a quad fetch, triple issue, fully out-of-order architecture. The PPC specification allows for a somewhat weak memory ordering scheme, but PA Semi goes one step further and makes it strong all around. They ran simulations and found that strong ordering did not cost them much if any performance, but weak ordering can have a down side.

That down side is if a programmer does not put in the correct barriers to guard against memory problems, ported code may blow up on a slightly different PPC core. When this happens, the customer blames PA Semi, not their own sloppy programmers, and generally makes life hard. The enforced strong memory ordering shuts down this problem before it happens. Instead of a ranting customer that can't port, you get a happy customer and maybe a 1% performance loss. This one looks like a good call, but can lead to premature baldness because the engineers tear their hair out during all of the added verification.

The core has a 64K 2-way associative Icache and another 64K for the Dcache, each of which can sustain 32GBps of read or write bandwidth. The Icache does 4 instructions per fetch so it will consume far less bandwidth than it's potential. The core also has a 1024-entry TLB . The pipeline length is a rather nebulous subject, and can change depending on how you define words like 'is' and 'a'. The diagrams in the PA Semi presentation list a maximum 19 stages for some of the more esoteric FPU and VMX instructions, but most end by 16, and int finishes by 14. The number they like to quote, quite rightly, is the branch mispredict latency in clocks. So, if someone asks you, and you don't feel like going into the full debate, tell them 12 stages, otherwise get a cup of coffee and a comfortable chair.

The L2 cache on the PA6T-1682M is 2MB shared among two PA6T PPC cores. The really interesting bit is that the cores are not connected to the cache itself but to the CONEXIUM crossbar. This means it is addressed in a serial fashion, but can also be used as a cache by other parts of the system. It can be an I/O or DMA cache as well as a CPU cache, but each unit needs to wait it's turn. Luckily, interconnect it is on is up to the job, and the cache can pass up to 1G addresses per second, and it is pipelined so it can have multiple things in flight at once. It also can send and receive data in parallel.

This rather unique configuration leads to a bandwidth of 16GBps for reading plus 16GBps for write. I can't think of anything in the x86 world that approaches this kind of architecture, but seeing the brains behind the chip itself, I would assume they did it for a reason.

This all breathes through two DDR2 controllers that can support DDR2-1066 DIMMs with a combined 16GBps of bandwidth. They are again connected to the crossbar, and as such are accessible to any part in any order, just one at a time. The intelligence of the CONEXIUM crossbar, coupled with highly pipelined and deep queues on the controller itself allows for some fairly high utilization numbers, more than we are used to in the PC world .

In addition to the low power mantra, the second most important thing was latency. They tried to wring latency out of every corner of the system, and seem to have succeeded. If you wonder why this matters so much, look at what an integrated memory controller did for the Opteron. Then go back and read Jim Keller's resume, what a coincidence! Apply this philosophy to just about every part of the CPU and you get an idea of what they are aiming for. AMD is pulling PCIe on board in a year or two, PA Semi will beat them to it by a lot.

The results are quite good, the latency, load to use, is a worst case 110 clocks or 55ns at 2GHz. It only gets better from there with L1 accesses at 4 clocks, L2 at 22, open pages at 90, and a remote L1 hit takes 30. When looking at these numbers, remember that the L2 and remote L1 numbers are across the crossbar, so if you cringed when you first read that, take a deep breath and relax.

All loads and stores are fetched in-order, can be issued out of order, and are then re-ordered before retire. It all ties together, and everything going over the crossbar is done in the correct order in both directions.

That brings us to the CONEXIUM itself. It is a fully coherent (MOESI) ordered crossbar that connects the two PA6T cores, two memory controllers, the L2 and the IO bridge. Everything on the other side of the bridge is connected by the ENVOI interconnect, hence the need for a bridge. CONEXIUM also has an address bus running at half the core clock, or 1GHz for the current 2GHz core.

The address bus does just what it sounds like, it passes address back and forth, and plays a big part in enforcing the strong ordering of the chip. When asked if the bus had sufficient bandwidth to keep up with the crossbar below, said it was 'way overkill for a two CPU system with two memory controllers', or to use a more engineering oriented parlance, it is pleasantly sufficient.

The crossbar below that is also pleasantly sufficient. It can be scaled up to go along with future expansion of the PWRficient architecture, and provide every port with a 16 byte wide full duplex connection. The switch itself can handle 1G transactions per second and pass 64GB of data.

So the whole CONEXIUM package is the crossbar and the address bus and it can be enlarged or reduced as units are added and subtracted to the core. There are three things that can initiate a transaction across the interchange, the two cores and the IO bridge, presumably that will scale as you add cores . All devices can respond of course, it would be rather pointless if they could not.

What you end up with is a horrendously fast interchange that can fling data around from point to point without worrying much about stepping on someone else's proverbial toes. It is smart enough to arbitrate multiple requests for the same resource, and keeps the house in order, strongly.

The other half of the PA6T-1682M are the bits hanging off the other side of the bridge. They are the accelerators, I/O cache, DMA offload, PCIe controllers, 10GbE, GbE and the SERDES. The interconnect is called ENVOI, and they have dubbed it an 'intelligent' IO system. It needs to have a lot more abilities than the CONEXIUM switch because of the vastly more disparate collection of clients and their specific needs.

The first class of devices that hang off ENVOI are the offload engines. These are probably the most interesting part of the chip to me, everyone is talking about doing this soon, PA Semi is there now. If you have any doubts about the the foresight of the P.A. Team, compare the list of accelerators to any other company's roadmaps. A year or two is an eternity in this business.

The first one is an encryption engine. It can do 3DES, AES, ARC4 and Kasumi (f8) for bulk encryption, MD5, SHA-1, SHA-256 and Kasumi (f9) for signatures. Toss in packet level encryption like IPSEC and SSL, and you have a fairly complete crypto setup. The PA6T-1682M is not only good at slinging data around, but it slings it around privately. It can sustain 10Gbps of bulk encryption and potentially do 3,000 public-key handshakes a second with support from the PA6Ts, the PPC VMX extensions help out here in no small way.

The next engine is a checksum accelerator. It can go generic CRC type checksums or more sophisticated ones like those found in TCP/IP. With a similar accelerator, it can do XOR calculations for RAID in hardware. Between the two engines, they can speed up TCP/IP and RAID processing. If you use them both, you are most of the way to iSCSI acceleration, and PA Semi gets you all the way there. These engines together mean you can do TCP/IP at wire speeds or iSCSI with low processor utilization.

The next block is a DMA offloader. It does just what it says, it is specialized hardware to handle DMA. It allows you to go from one block to another, or to memory without CPU intervention. You can go IO to IO, Memory to Memory, or any combination thereof effectively for 'free' as far as CPU use is concerned. There are 64 receive channels, 20 transmit, and a 24KB buffer. It can pass 32GB/sec from all targets, PCIe, Ethernet and the offload engines, basically the entire chip.

Moving along, we get to another block that does exactly what it sounds like, the IO cache. It is basically a 128 line cache for the ENVOI system. It can write combine, supports prefetch and descriptor caching .

Moving along, we get to the PCIe controllers. There are 8 engines on the block, and they can host 1 to 16 lanes each , limited only by the number of SERDES lanes you want to use. They support host or endpoint partitioning so you can use the PA6T-1682M as either a CPU or as a coprocessor on an add-in card, it can be either end . There are two virtual channels and two priority levels supported.

There are two separate Ethernet controllers one for 10GbE and one for GbE. The 10GbE controller supports two separate links which talk to the world through a XAUI interface. The GbE controller has four links and talks to the world through a SGMII interface. If you configure the PA6T-1682M for full network bandwidth, you can get 24Gbps out of it either over IPv4 or IPv6. Couple this with the offload engines, and you have a potential monster network box on your hands.

Probably the most interesting part of all this is the SERDES that is the real link to the outside world. SERDES stands for Serializer Deserializer, and it does just that. The one on the current chip has 24 lanes, and you can configure them any way you would like with PCIe, 10GbE or GbE, or mix and match them any way you would like. The GbE links take one SERDES lane, 10GbE takes 4, and PCIe takes one per 'x'.

This means you could have 2 10GbE links and 16 PCIe lanes, or four GbE links and 20 PCIe lanes. How you divide the PCIe up is also totally granular, set it up the way you need. The only limit is the 24 SERDES lanes, but it isn't all that harsh a cap . Since the lanes can be set either in the bootROM or the BOIS, you can either set it up for the device it is in, or have it figure out the best settings based on it's environment. During PCI discovery, the PA6T-1682M can sense that it needs 8 PCIe and 8 XAUI lanes, or some other combination, and then set itself. This may add a little time to the boot process, but it allows you to make cards out of the chip that go into varied environments with little or no handholding. Flexibility and clever BIOS coding pays off handsomely here.

There is also a list of miscellaneous other bits that fall under the platform IO heading. These include a power controller, system controller and interrupt/GPIO controller all tied into the CONEXIUM crossbar. There is a boot bus, SMBus and UART controller also tied into the ENVOI side of things. Together they allow the machine to boot and perform basic housekeeping duties not really related to core functionality .

To tie all of this together, you have the ENVOI IO system itself. . ENVOI provides a single centralized DMA model and arbitrates all the requests from all end points. It can allocate bandwidth by controlling per-channel buffer depth, modifying how arbitration and prefetching are handled, and also do some global address translation.

There are also some very specific and persnickety ordering rules for doing some protocol transfers. They are both timing and order related. TCP/IP is really lose about both, as long as you break up packets correctly. PCIe is probably a lot less so, especially if you are pushing sound or video data. A hiccup in either has some very noticeable results.

The reason ENVOI has to be 'smarter' than CONEXUIM is that these ordering issues all come into play at once. You can have some 'lose' and some 'sensitive' streams all vieing for their shot at the wires. ENVOI has to decide who wins and who loses without violating any of the overall rules of the system. This level of attention is not needed on the CONEXIUM side. One PA Semi person, Mark Hayter, described the ordering rules as 'scary', and I agree, it hurts to think about all the potential cases, much less to test them.

There are two other pieces to take note of, the Transaction Trace Memory (TTM) and Peripheral Trace Memory (PTM). They are both debug features, the TTM keeps an eye on addresses while the PTM watches peripherals. Both are notoriously hard to debug at the high speeds supported by the PA6T-1682M, and without the two debuggers, life would be fairly miserable.

The TTM keeps a log of the address transactions used on the system but holds no raw data. The PTM does the same for almost all peripherals or allows you to trace raw SERDES data. Each one has 16K of memory to store information when things go the 'undocumented feature' route.

All of these parts are designed with a single overriding goal, low power. As we stated earlier the first chip of the family running at 2GHz takes about 13W while processing 10Gbps of data, and never breaks 25W or power draw. A lot of this was done by designing the chip with power in mind before mouse was laid down to EDA software.

If you remember the breakthrough of the Intel PM CPUs, they turned off unused blocks to save power, and boy did it. The next generation cores, the Merom family take that to the next level and only turn on bits only when needed. PA Semi looks at these crude tools and laughs. They have a a very fine grained clock, and have over 15K gated clock domains on the entire die. When you have this level of control, blocks are, well, blocky.

On a macro level, they can vary the voltage between .6 and 1.1, most current CPUs do this as well. The power controller can also vary the CPU frequency on the fly, but that is almost a given for a modern CPU. The memory controller also takes full advantage the power savings built into DDR2, and can do a lot of things with memory to save yet more power. Similarly, unused IO lanes can be turned off to save power, and if the external device supports a method of power savings, the PA6T-1682M will most likely do so also.

On a much more macro level, there are separate power rails for the cores, IO and PADs. Since this is a product aimed at the embedded space, sometimes things like area are more important than power use. If this is true, you can run the entire chip off one power supply and suck down a little extra wattage. If you can spare a few extra mm, the power savings are there for the taking. Either way, like most of the chip, the final decisions are up to you.

Each of the pieces above form a chunk of the PWRficient architecure. You have the cores, the accelerators, two different interconnects, memory controllers and cache. Each of these things can be varied, and since the architecure was designed with modularity in mind, you can make an awful lot of configurations in a short period of time. P.A.Semi claims they can tape out a part in as little as three months, the time saved on the back end was made possible through smart choices up front. The pieces were made modular , and the interconnects were either scalable or powerful enough to accept the largest possible configuration from day one.

On the core side, the architecture supports 1, 2, 4 or 8 PA6Ts and 1, 2, 4 or 8MB of L2 cache. You can also have 1, 2 or 4 memory controllers, but in a blatant flaunting of symmetry, it will not allow for 8 . The IO side is similarly flexible, you can have 4, 12, 24 or 32 SERDES lanes supporting PCIe, SGMII, and XAUI in just about any configuration you want. So I can add more numbers to an already overloaded paragraph, there are chips planned at 1, 1.5, 2 and 2.5GHz.

Pa-semi-star-diagram

This leaves you with so many options it is hard to count, and the short tape out period suddenly makes a lot more sense. If done in the traditional fashion, there is no way you could get more than one or two variants out in any sane amount of time. PA Semi hopes to gets lots of variants out on the market, and is open to custom parts if the volume is there.

The PA6T-1682M, the first of the PWRficient line, occupies the middle tier of the market. The family is defined by three sockets, Entry level 1-2 cores, Mid-Tier at 1-4 cores and High End with 4 or 8 cores. The reason for the overlap is the plethora of other things that can come into play, most notably the memory controllers and the SERDES lanes. Each one needs a set number of pins, and the rest is taken up by power and ground. The mid-tier socket has 1156 pins, but since it is an embedded product, that is not quite the same thing you are used to in an x86 CPU. It is more of a pin count and pin location layout than something you plug a chip in to by hand. .

So, what does all this get you for your 13W typical, 25W max? The chips are quoted as being >1000 SPECint per core and >2000SPECfp per core at 2GHz. This is a huge number, high end x86 cores score about 2000/2000 but consume around 10x the power. Add in all the other functionality and low board space requirements due to the integration, and you have a performance per watt monster.

The integration also pays off when you are looking to build something around a PWRficient CPU. In an example setup, you only need a handful of chips to make a system. You can get by with a USB controller, Ethernet Phys, a buffer for IDE drives and another for the a CF socket. Add a boot flash chip, a bridge for PCI-X and a few DIMMs and you are done.

Pa-semi-example-board-layout

The more you can pull into the chip, the less that goes on the board. This means easier design, lower component costs, and quicker time to market. If you want to go to higher socket counts, you can do that also, but in a sort of indirect fashion The PA6T-1682M is only a single socket product, so if you slap 4 on a board, there will be no coherency between the chips.

This means it can be done electrically, but logically, it is a free for all. With clever PCIe memory mapping, you can fairly effectively make the chips not step on each other's toes much. You get the rest of the way to cooperation in software, but it will be added work. Then again, for an 8 core 16K SPECfp computer, it isn't all that bad of an investment.

So, in a nutshell, you have a lot of possibilities and a single product a few months out. The PA6T-1682M is beachhead to a new architecture that potentially spans hundreds of products, but realistically PA Semi will only make a handful. The engineers who set the groundrules early on focused on power, modularity and latency, nothing new, nothing paradigm shifting, but may have ended up achieving both of those things. They can mix and match from a large toolbox in a short time, and end up with a fairly high performance part that sips power. This was all done not by reinventing the wheel, there is no need for that, they just implemented it right from the start.

When the first chips hit the market late next year, we will see if all the claims are true. Judging from the track record of the people involved, it is probably a safe bet that it will work out the way they claim. The only down side is that because the PA6T-1682M is aimed at the embedded market, you probably will never see one or know you are using it.

Time will tell if the PWRficient line will do well, the market has a rather strange way of deciding things like this. In the mean time, we will be watching the technical developments at PA Semi closely, they seem to be off to a flying start. µ

Share this:

Comments

There are no comments submitted yet. Do you have an interesting opinion? Then be the first to post a comment.

aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

Authorities in several countries raided Megaupload recently, shut down all of its services, seized hundreds of servers and arrested several of its executives on criminal charges.

Do you think the move was justified?