Jump to content
The Inquirer-Home

Why Nvidia's duff chips are due to shoddy engineering

Part Two The underfill
Monday, 1 September 2008, 21:46

This the second part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part One can be found here and Part Three is here.

GETTING BACK to the underfill, this is probably the key to the problem. There is one more property of underfill called the glassification temperature, Tg for short. Tg is not melting, it is more the temp that is goes soft and looses most of it's structural rigidity. The underfill that Nvidia used, Namics 8439-1 is what's called a low Tg material, and the Hitachi 3730 has a higher Tg.

To be fair to Nvidia, about the time when the G84 and G86s were hitting the market, high Tg underfills were pretty rare and new to the market. Low Tg underfills, such as the Namics material that NV used had been available for a while, and were 'known'. The last thing you want to do is put a high risk part on a new and market untested material, so it looks like they went with the safe choice, low Tg.

If Nvidia did their homework right, the Tg of the material should never be hit, the chip should always run below that temp, and the underfill should provide the mechanical support needed to keep the high lead bumps from fracturing. This is why you engineer, test, retest, simulate, pray a lot, and pick your materials very carefully.

Namics_temp_vs_strength_small

Namics 8439-1 underfill temp vs strength curve

Here is the Tg curve for Namics 8439-1. Let us be the first to say there appears to be nothing, repeat, nothing wrong with this material, it does exactly what it says it does. It starts to lose strength at about 60C and by a little over 80C it has 100 times less rigidity. Think going from hard plastic to jello. What temps do GPUs run at again? What is the Tj (transistor junction temperature) for them? Ooops. Big hundreds of millions of dollar ooopsie right here.

So, the failure chain happens like this. NV for some unfathomable reason decides to design their chips for high lead bumps, something that was likely decided at the layout phase or before because the bump placement is closely tied to the floorplan. At this point, they are basically stuck with the bump type they chose for the life of the chip.

The next choice was the underfill materials, and again, they chose the known low Tg part that had far less tolerances than the newer to the market high Tg materials. It was a risk vs risk proposition, likely with a lot of cost differences as well. They chose wrong, very wrong. The stiffness of the Namics material might be perfect below the Tg, but once you hit it, it is almost like it isn't there, and the stress transfers to the bumps while they are hot and weak.

Fanbois will cry that their $.23 temp sensor is reading much lower temps than that, so there is no way this could be an issue. Well, the temp sensors on many cards are not on die, much less between the die and the substrate. They are also cheap and notoriously inaccurate. To top it off, they only measure average temp across the chip, not hot and cold spots. If you look at the IR photo in the previous part of this story, you can see that if you move the sensor from the right side to the left, you will get vastly differing readings. In this case, a real current chip, it will vary by as much as 30C depending on placement.

Many people also don't realize that it is easier for heat to travel down through the pins, they are mini-heat pipes, than it is to cross the three thermal barriers (die -> thermal paste -> heat spreader -> thermal paste -> heatsink) to the heatsink. That means those little bumps take a huge thermal pounding, and are usually hotter than the surface of the heat spreader.

To make matters worse, the bumps that are under the hot spots get hotter still. Piling on the pain, they carry the most current, and the hotter things get, the more heat they generate, and the more resistance they usually have.

Could it get worse? Of course it could. Remember thermal stress? The expansion is highest at the point, wait for it, that is hottest. That would be under the hot spots, and it puts the most stress on the bumps that are weakest.

This is why you have to pick your underfill very carefully, you have to relieve as much stress as you can from the bumps. Too little and they go snap, and the chip dies. Too much and you pull the polyimide layer off and the chip dies. Basically, you go as stiff as you dare, then test the hell out of it under the conditions your simulations tell you will be present. Test, test, test, test or dies die.

When the underfill glassifies, it means you are at the hottest point on the die, the bumps that it is protecting are under the most heat, pulling the most current, and under the most thermal stress. If the underfill essentially turns to jello, it is very bad. If you compound that by using bumps that bond poorly to the substrate, it makes things worse. If those bumps are stiffer than the other option, it is worse yet.

Let's go down the checklist for Nvidia. High thermal load? Check. Unforgiving high lead bumps. Check. Eutectic pads? Check. Low Tg underfill? Check. Hot spots that exceed the underfill Tg? Check. If you are thinking this looks bad, you are right, expensive too.

If it was just as simple as the underfill glassifying, the parts would have never made it to market. It is much more complex than that. The problem with thermal stress is that it is somewhat additive, it weakens parts long before they actually break unless it is quite extreme.

An example of extreme thermal stress would be to take a glass cup, preferably non-tempered, and put it in the oven on max. Pull it out and drop it in a bucket of ice water, and voila, instant thermal stress demonstration. Wear eye protection. The thermal stress that the bumps see is much more like the fork example earlier, it gets weaker and weaker with each bend, until snap, black screen.

If you recall, the Nvidia parts are breaking at the bump to substrate connection. This is the weakest point in the chain, and it is where they made the worst possible materials choice. It is not really a surprise that it failed. It is simply shoddy engineering.

So, what can be done by Nvidia at this point? Well, changing to high Tg underfills is a start, as is changing to eutectic bumps. The high Tg underfill option has come down in risk substantially since the G84 and G86 parts were introduced, so that is a no-brainer, and guess what Nvidia did to the G86? And the G92 as well.

The problem of changing bump types is far thornier, but Nvidia is doing that as well. From the intelligence we have been able to gather, Nvidia has not made any power distribution changes to the parts, there is no power grid, no power plane, or no anything to protect the eutectic bumps from electromigration. They may be able to keep them under their current capacity, but by how much?

This is emblematic of the 'pants are on fire' school of engineering, and reports from inside Nvidia confirm that they are in full panic mode over this snafu. With short time horizons to fix a massive batch of defective parts, reliability engineering usually draws the short stick. µ

Part Three: The cock-up, is here

Share this:

Comments
The odds are

... at some point on the hybrid chips (high lead engineered dies with eutectic actual bumps) the current tolerance per bump will be exceeded and electromigration will mean more part failures but the question is how long it takes for eutectic failure cf high lead? Does it fall within warranty periods? Is that what matters to nVidia?

From what I have observed of nVidia through limited interaction over chipsets, their commercial conscience was not congruent with my priorities as a customer.

When Jen-Hsun Huang said recently nVidia was so dedicated to customers it was ready to spend $200 to fix a $20 part failure, it strikes me he could have been expressing a sincere change of heart, or then again...

posted by : Richard, 01 September 2008 Complain about this comment
Its Manufacturering error

If it where simply due to useage, Water Cooled model would Not have such problems. I believe problems are built in right from final (brand Name) factory. 

Meaning enginnering is probably OK, its Board assembly of part that starts defect Trail.

Song medoly: There Once was Lone Cowboy from Streets of Laraedo, All outfitted in Best of White Linen....Hang Down Your Head Tom Dooley, Hang Down Your Head & Cry....Poor GPU, Your Gonna.ds Die.d.

Each Unit fine example Until Next Butcher takes its presious cargo & blothces it. Here Assembly Heat Has Ruined what may be perfect design (except so many couldn't take assembly heat).Ahso, Except that Power of Great Ultee' blast past Known Thermal envelopes & make linen Schorched randomly, now out of sync with specs.
drashek

posted by : Doc_Tom, 01 September 2008 Complain about this comment
GTH

You ppl are such jerks. When a company is going through rough patch, you always try to make sure that you jump on top of it and sink it further by creating more negative publicity. NV has given us some great GPUs in the past few years, and 1 bad series can't take that away by your stupid reporter's stupid articles. 
Get a life and do some good to the world!

posted by : Alex, 01 September 2008 Complain about this comment
What about ATI?

So what material combo is ATI using on its parts? Those new Radeon 4800 cards run at 70-80 degrees, that's alarmingly high.

posted by : fastpunk, 01 September 2008 Complain about this comment
And...

All the king's horses and all the king's men,
Couldn't put Bumpty together again.

posted by : unknownjd, 01 September 2008 Complain about this comment
Remember the capacitors

Excellent journalism!
I think Nvidia is a great company. Though they have milked people for their chips too long - AMD isn't any better in that respect. I hope they do rebound & continue to provide
excellent competition for Intel & AMD. I don't think this is a unique scenario, just look
back at the faulty electrolytic capacitors that flooded the market recently. I still haven't
noticed any significant numbers of laptops or discrete cards (though AMD seem to dominate the
Foxconn PCs) failing.
Keep up the good work Mr C.

posted by : S, 02 September 2008 Complain about this comment
Why only NV has problems?

Hi,

I was wondering: if everybody has been the same materials for years, and many other chips run at 70-80C (or more), how come only now, and only Nvidia has so much trouble with dying chips?

Could somebody explain what is different in AMD/ATI or Intel chips that they don't get damaged by high temperature?

Thanks.

Lukasz

posted by : LukaszN, 02 September 2008 Complain about this comment
Good research

Well done article, good research, sound facts.

Poor engineers at NVidia. My guess is that this mess is not really their fault. Engineers know you should abide by the centuries old principles of sound engineering, as Charlie rigthly points out in his articles.

Based on my long, long years of industrial experience, this stuff-up was likely brought about by clueless managerial types aiming to accelerate their career paths.

Of course time to market, production cost etc. are important factors. But in the end you can't get around proper engineering, or it comes back to haunt you.

May NVidia survive this disaster to fight another day, and may others learn from this.

posted by : Maarten, 02 September 2008 Complain about this comment
Brilliant!

Bravo Charlie, well done! That’s the kind of writing and research that used to define Charlie D. 

Excellent piece, parts one and two.

SPARKS

posted by : SPARKS, 02 September 2008 Complain about this comment
Let us know when you have proof

Another story about engineering with "sources inside nvidia panicking". But no evidence any of it is true.

Thanks for a good read. Let us know when you can prove it. Like say, a vid card maker going on record saying 40% of their cards are crapping out. Not more "sources at blah blah, say it's true".

So what. This line of stories is really old FUD.

posted by : The Jian, 02 September 2008 Complain about this comment
Strain, not Stress

Chucky's beginner mistake on this article is that he does not know the difference between stress and strain.

Stress is the pent up force, strain is the physical deformation that results when the force overcomes the resistance through which it built up.

Due to thermal expansion, the bumps are placed under stress. If that stress is greater than what the substrate can handle, the joint becomes strained, breaking the contact if strained enough.

When the underfill looses it's rigidity due to high temperature, the substrate becomes less resistant to stressing force. The substrate and the now unrestrained bumps may deform. (become strained)

The underfill does not relieve the bumps of stress as Chucky stated. It prevents the stress from becoming physical deformation (strain). 

I'm pretty sure it was high school physics where this is taught. You did graduate high school didn't you Chucky?

No matter, Chucky lost all credibility when he declared in a brutal rant that he would never write about or use Vista again ... but kept writing about and using Vista. 

If you believe Chucky's weak manipulation of coincidences as an explanation for the high failure rates of Nvidia GPU's, then I've got a lovely piece of bread to sell you that has the image of whatever false diety you worship on it that appeared through the divine action of my toaster.

posted by : Ken, 02 September 2008 Complain about this comment
I am never overclocking anything again!

I wonder will anyone else.

The part about concentrations of heat in specific areas, whilst the temperature readout is only the average, that's a bit of a concern.

Nice one Charlie, let's have another one.

posted by : interested_party, 02 September 2008 Complain about this comment
Logic and reason

Charlie has now given us more logic and reason as to why the NV parts had design flaws and why they are failing. For those of you who question why ATI parts are not failing, ATI simply designed theirs better. My Sapphire Radeon HD 2600PRO idles at 55C to 57C and under load in gaming, at up to 64C. No failure. Better design. ATI didn't set aside 200 Million to fix notebooks with defective GPU's. ATI desktop parts aren't failing either. Can't wait to see what part 3 has for you NV fan boys to betch about.....

posted by : Eric, 02 September 2008 Complain about this comment
The Real Ultimate Cause

The corporate need to remove costs from the business.

The corporate need to increase profits by reducing embedded cost of material.

The corporate need to reduced costs by reducing R&D expenditures.

The list goes on and on.


posted by : Doug Glass, 02 September 2008 Complain about this comment
Speakers & Monies.

Take audio speaker, 8 ohm model can run at 4 ohm will probably play OK yet, Not as well, just louder than when at correct 8 ohm. 
Reverse, 4 ohm speaker played at 8 ohm takes more power to reach acceptable volumne, More strain. However at near 0 ohm, entire amplifier blows out. Thats Not much variation in resistance to produce such near fata/fatal or variant results.

Here Nvidia May need huge write off to lessen burden of Agiena PhysX purchase. As Nvidia is Not that Stupid. I though Kens remarks where bit daft, as chuckie leaden hackett man, Not engineer. So why blame Msr. Demijerian?
drashek

posted by : OMG_II, 02 September 2008 Complain about this comment
Ignorance

you call yourself a journalist and you can't even sort the difference between "loses" and "looses"? you're an embarrassment to journalism, and you sound like a poorly educated angst ridden fan boy. I'm guessing that's because that is what you are. you should be writing user reviews for gamefaqs.

posted by : Drake Foster, 02 September 2008 Complain about this comment
1. Good Work. 2. what's worse is...

NVidia didn't simply put heat sinks on the too hot chips.

Or bigger heat sinks, if they had sinks on 'em.

I've read about people using IR imagers on their mobos and cards to discover ( without zapping 'em ) what chips need cooling, and go adding it themselves, for reliability.

The thermal limits are manoeverable if you remember all the tools available, ya?

posted by : Captain Obvious, 03 September 2008 Complain about this comment
Tg

I don't think you have quite understood glass transition temperatures Charlie...

"When the underfill glassifies, it means you are at the hottest point on the die"

If the prevailing temperature is greater than the materials glass transition, it will be rubbery. If the temperature is lower that the glass transition, the material will exist as a glass. So in the example (Temp >80 degrees C) the underfill will be rubbery not glassy. 

Is glassifies even a word?

posted by : Chris, 03 September 2008 Complain about this comment
keyrings anyone?

You just know there is going to be a glut of Nvidia chips sunk into clear pastic to make up a charming little keyfob.

At least then Nvidia could recoup some of its lost earnings...even then I bet they could screw the manufacturing of that up too....

LMFAO...

What a shame...


posted by : 99flake, 09 September 2008 Complain about this comment
Advertisement
Subscribe to the INQ Newsletter
Sign-up for the INQBot weekly newsletter
Click here to sign up Existing user
Advertisement
INQ Poll

Windows 7 impressions

How is windows 7 working out for you?