This the second part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part One can be found here and Part Three is here.
GETTING BACK to the underfill, this is probably the key to the problem. There is one more property of underfill called the glassification temperature, Tg for short. Tg is not melting, it is more the temp that is goes soft and looses most of it's structural rigidity. The underfill that Nvidia used, Namics 8439-1 is what's called a low Tg material, and the Hitachi 3730 has a higher Tg.
To be fair to Nvidia, about the time when the G84 and G86s were hitting the market, high Tg underfills were pretty rare and new to the market. Low Tg underfills, such as the Namics material that NV used had been available for a while, and were 'known'. The last thing you want to do is put a high risk part on a new and market untested material, so it looks like they went with the safe choice, low Tg.
If Nvidia did their homework right, the Tg of the material should never be hit, the chip should always run below that temp, and the underfill should provide the mechanical support needed to keep the high lead bumps from fracturing. This is why you engineer, test, retest, simulate, pray a lot, and pick your materials very carefully.
Namics 8439-1 underfill temp vs strength curve
Here is the Tg curve for Namics 8439-1. Let us be the first to say there appears to be nothing, repeat, nothing wrong with this material, it does exactly what it says it does. It starts to lose strength at about 60C and by a little over 80C it has 100 times less rigidity. Think going from hard plastic to jello. What temps do GPUs run at again? What is the Tj (transistor junction temperature) for them? Ooops. Big hundreds of millions of dollar ooopsie right here.
So, the failure chain happens like this. NV for some unfathomable reason decides to design their chips for high lead bumps, something that was likely decided at the layout phase or before because the bump placement is closely tied to the floorplan. At this point, they are basically stuck with the bump type they chose for the life of the chip.
The next choice was the underfill materials, and again, they chose the known low Tg part that had far less tolerances than the newer to the market high Tg materials. It was a risk vs risk proposition, likely with a lot of cost differences as well. They chose wrong, very wrong. The stiffness of the Namics material might be perfect below the Tg, but once you hit it, it is almost like it isn't there, and the stress transfers to the bumps while they are hot and weak.
Fanbois will cry that their $.23 temp sensor is reading much lower temps than that, so there is no way this could be an issue. Well, the temp sensors on many cards are not on die, much less between the die and the substrate. They are also cheap and notoriously inaccurate. To top it off, they only measure average temp across the chip, not hot and cold spots. If you look at the IR photo in the previous part of this story, you can see that if you move the sensor from the right side to the left, you will get vastly differing readings. In this case, a real current chip, it will vary by as much as 30C depending on placement.
Many people also don't realize that it is easier for heat to travel down through the pins, they are mini-heat pipes, than it is to cross the three thermal barriers (die -> thermal paste -> heat spreader -> thermal paste -> heatsink) to the heatsink. That means those little bumps take a huge thermal pounding, and are usually hotter than the surface of the heat spreader.
To make matters worse, the bumps that are under the hot spots get hotter still. Piling on the pain, they carry the most current, and the hotter things get, the more heat they generate, and the more resistance they usually have.
Could it get worse? Of course it could. Remember thermal stress? The expansion is highest at the point, wait for it, that is hottest. That would be under the hot spots, and it puts the most stress on the bumps that are weakest.
This is why you have to pick your underfill very carefully, you have to relieve as much stress as you can from the bumps. Too little and they go snap, and the chip dies. Too much and you pull the polyimide layer off and the chip dies. Basically, you go as stiff as you dare, then test the hell out of it under the conditions your simulations tell you will be present. Test, test, test, test or dies die.
When the underfill glassifies, it means you are at the hottest point on the die, the bumps that it is protecting are under the most heat, pulling the most current, and under the most thermal stress. If the underfill essentially turns to jello, it is very bad. If you compound that by using bumps that bond poorly to the substrate, it makes things worse. If those bumps are stiffer than the other option, it is worse yet.
Let's go down the checklist for Nvidia. High thermal load? Check. Unforgiving high lead bumps. Check. Eutectic pads? Check. Low Tg underfill? Check. Hot spots that exceed the underfill Tg? Check. If you are thinking this looks bad, you are right, expensive too.
If it was just as simple as the underfill glassifying, the parts would have never made it to market. It is much more complex than that. The problem with thermal stress is that it is somewhat additive, it weakens parts long before they actually break unless it is quite extreme.
An example of extreme thermal stress would be to take a glass cup, preferably non-tempered, and put it in the oven on max. Pull it out and drop it in a bucket of ice water, and voila, instant thermal stress demonstration. Wear eye protection. The thermal stress that the bumps see is much more like the fork example earlier, it gets weaker and weaker with each bend, until snap, black screen.
If you recall, the Nvidia parts are breaking at the bump to substrate connection. This is the weakest point in the chain, and it is where they made the worst possible materials choice. It is not really a surprise that it failed. It is simply shoddy engineering.
So, what can be done by Nvidia at this point? Well, changing to high Tg underfills is a start, as is changing to eutectic bumps. The high Tg underfill option has come down in risk substantially since the G84 and G86 parts were introduced, so that is a no-brainer, and guess what Nvidia did to the G86? And the G92 as well.
The problem of changing bump types is far thornier, but Nvidia is doing that as well. From the intelligence we have been able to gather, Nvidia has not made any power distribution changes to the parts, there is no power grid, no power plane, or no anything to protect the eutectic bumps from electromigration. They may be able to keep them under their current capacity, but by how much?
This is emblematic of the 'pants are on fire' school of engineering, and reports from inside Nvidia confirm that they are in full panic mode over this snafu. With short time horizons to fix a massive batch of defective parts, reliability engineering usually draws the short stick. µ
Part Three: The cock-up, is here