This the first part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part two can be found here and Part Three is here.
NVIDIA HAS RECENTLY been saying a lot about how it's chips are not bad, and giving people reasons about why the problem is contained. Unfortunately, these disingenuous half-truths don't stand up to an explanation of why this problem is happening.
The problem is extremely complex and defies a simple explanation. It involves multiple poor choices, multiple engineering failures, and likely a few bad accounting choices. This piece could also have been entitled: "More than you ever wanted to know about bumping, and then some: How not to do things". But we will simplify the science and technical details as much as possible to make it accessible, so some things may be oversimplified.
The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.
The end result of the failures is that bumps crack between the bump and the substrate on a chip, not on the bump to die side. When this happens to a signal bump, game over for the GPU or MCP. What is a bump, die and substrate? Why is it happening? That is a long and technical story.
A Via CN chip, note the die in the centre
First, let's start out with some terminology, illustrated here by the lovely and talented Via CN/Nano chip. As you can see, the total package is about the same area as a US quarter. The most important part is the black square at the centre, that is the die, or the silicon chip itself. The green fibreglass-like part around it is the substrate, a complex multi-layered organic material that routes signals from the pads on top to the pins on the bottom, and serves as an attachment point for the die and various passive components. Those are the little silver things around the edges.
The die itself looks a little rough around the edges, but in reality it is very very angular. It has four corners at 90 degree angles, this one being almost square. Some, like the Intel Atom for example, are much more rectangular. The blurry edges are due to a material called the underfill, it looks like glue seeping from the edges, and serves as mechanical support for the die to substrate bonds and a moisture barrier to protect the bumps.
Via CN about as thick as a quarter
The part you don't see are the bumps, and they are the most critical part. This type of packaging is called flip-chip because the connectors between the die and the substrate are put on the bottom of the die, and it is flipped over onto the package. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimetre on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight.
As you can see, the package is about the same height as a quarter as well, so the vertical tolerances are also pretty slim. The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated.
Those are the biggest players in our little drama, now let's move on to some basic physics and related science. Chips consume power, and in return they give you heat and a few electrons in the right places, occasionally they also give you a flash of light and smoke as well, but few chips do that twice. Heat is not an intended product, it is a consequence, and has to be carried away or bad things happen.
Modern chips consume electricity in an uneven manner, as different parts of the chip use power at different rates. Sometimes parts of the chip are never used at all for a given workload. If you have a modern GPU and don't game or are smart enough to not run Vista, you will likely never touch the transistors that do all the 3D work. Think about it this way, there are hot spots on the chip as well as cold spots, it is uneven and changing constantly.
A typical IR photograph of a multi-core CPU
Related to this is the fact that the chip uses electricity in a non-uniform manner. Parts that are heavily used pull much more current than idle parts, and once again, those parts change over time. Some bumps may pull a lot of Amps, others may pull very few, and this again changes over time and use. The bumps also have a limited current capacity each, too much and they melt or burn out, so there are far more than are strictly needed to supply the chip with power.
The idea is to make sure no one bump will ever reach the maximum current it can handle. This is done by putting in more power bumps on the die in places that use high power than are needed from an average current point of view. If things are done right, no single bump will ever exceed the maximum current it can deliver.
The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic, see here and here. Eutectic materials have two important properties, they have a low melting point, and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.
Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than you do if you use a high-lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high-lead bump on eutectic pad route.
High-lead bumps have a much higher current capacity than eutectic bumps. When power is run through eutectic bumps, you also get an effect called electromigration. This means that some of the materials are essentially pushed around by the current, and you get voids in the bump. These voids lessen the capacity of the bump, and eventually they burn out.
The more current you run through a eutectic bump, the quicker the electromigration. If you keep the current to a reasonable level, the time it takes for this to happen will be so long it isn't worth worrying about. This is why chip vendors say that upping the voltage will shorten the lifespan of parts, it literally does cause them to burn out quicker.
On the good side, eutectic bumps are generally more flexible than high lead. This means they are a bit more forgiving to stress. Some forces that would fracture a lead bump may be absorbed by a eutectic one without problems.
Bumps overall are a multi-dimensional trade-off between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.
FROM BUMP properties, we move on to thermal expansion of materials, and that is another piece to the puzzle. Most materials expand as they warm up. If you have ever seen a mechanic trying to free a stuck bolt, they usually heat the nut with a blowtorch, this expands the nut and loosens it. The same thing happens with the die and substrate. When you turn on a chip, it heats and expands a little. This expansion is not much, but it is measurable. The substrate also heats and expands.
The problem is that the die gets hot, and heats the substrate secondarily. The silicon on the die has one rate of thermal expansion, the substrate has another, basically they get bigger at different rates. To complicate things further, remember the uneven and changing heating bit above? Parts of the die heat up and expand differently from other parts of the die. This changes quite quickly while things are in use.
The result? The bumps take a lot of stress, and it changes from second to second. This can be very accurately simulated, and you can engineer bump placement at points of lower thermal expansion and therefore lower stress. If you lose a power bump here and there, the chip will very likely survive, but if you lose a signal bump, game over. This is why bump placement is very important.
Engineering what bumps go where is a very complex process, and is done basically when the chip is laid out, near the end of the development process. You don't do it on a whim, you don't make pretty patterns because they are cool, you do it scientifically to minimise the potential for damage.
Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.
Once again, if you did your engineering right, this won't happen in any timeframe that matters to mere humans, if it takes ten years of on and off switching to make it happen, once a day power cycling won't matter in our lifetimes. Chip makers tend to engineer for timelines like the ten-year horizon, and are pretty safe in assuming it will live for five years of casual use.
If you recall, high-lead bumps are stiffer than eutectic and more prone to stress fractures. The high-lead-to-eutectic substrate bond is also weaker than a eutectic-to-eutectic bond. What is happening to Nvidia is that the substrate to bump joint is cracking, and the chips die. High lead bumps are a poor choice to use in this application.
One other bit to bring into the mix is underfill. If things were as simple as heat leads to cracking, no chips would work for any length of time. Underfill not only protects the bumps from moisture and contamination, but it also provides mechanical support as well. It is designed to take some of the stress that the bumps take, making them live longer.
Underfill can range from rock hard to soft and squishy, it depends on your application. The harder the underfill, the more mechanical support it provides, and the less stress the bumps take. Simple enough.
That brings us to another material, the Polyamide layer. When chips went to a low-K dielectric material, which is not the same as the high-K gate material, it proved a problem with packaging, bumps and underfill. The solution was to put a polyamide layer, sometimes called a stress layer, to cover the bottom of the chip. This prevents contamination and mechanical damage.
If you pick an underfill that is too soft, it doesn't provide you enough mechanical support for the bumps, they crack and your chip dies and early death. Pick one that is too hard and it rips the polyimide layer off. In the words of one packaging engineer talked to for this article, if you used too hard of an underfill, the chip "wouldn't survive the first heat cycle". The magic is in the middle, you have to pick a bowl of porridge, er, underfill, that is strong enough to provide the support you need, but not so strong as to rip layers off your chip. Like we said, package engineering is not for the faint of heart, but it can make baby bear happy.
That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.
If you take a hypothetical simple CPU that has an integer and floating point units. If you are doing heavy int. work, the power bumps that supply that part of the chip will be loaded heavily and the FP bumps will not be doing much of anything at all. When FP load gets heavy, the opposite happen.
The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won't get all that close to their maximum. To use completely made up numbers, take a bump has a peak capacity of 1000mA, and for longevity you don't want to exceed 800mA, basically a 20 per cent safety margin.
If the chip TDP divided by the number of bumps, IE the average current per bump is 200mA, there are likely many bumps drawing 100mA and a few under loaded areas that draw 600mA. This draw moves around with the work the chip is doing. Some may never break 100mA, others may be at 600mA for their entire lives. All are well below the 800mA average, much less the 1000mA max.
The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration becomes. Lets pick a hypothetical eutectic bump that has a peak capacity of 500mA and the same 20 per cent safety margin, 400mA max for long life. If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.
If the chip actually powers up without letting the smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case of that the gods of luck are staring right at you and the thing doe sn't fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.
What do you do? You can either cut the power used by the GPU way way down, ie, clock it at a point where no one would ever buy it, or rearrange where the bumps go. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can't be done and validated in the time the chip is on sale for.
The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.
All of these things can be dealt with if you see this coming when you start making the GPU. It is pretty painfully obvious that Nvidia didn't, otherwise they wouldn't have used high lead bumps and gotten into the hole that they are in. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts. µ
Part Two - The problem of underfill is here.
Keep an eye on that neighbour who's been talking about making a killer drone...
WiFi, why Delilah
We've only been waiting two years
That's a cray-cray amount of compute power