This is the Third and final part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part One can be found here and Part Two is here.
SOURCES CLOSE to Dell say they knew about the problem a year ago, and HP is on record as being aware in November, so there has been about a year to characterise the problem, design a solution and test it. Multiple sources involved with package engineering tell us that this is not nearly enough time to do a proper test regime, much less long-term reliability studies.
This new package and materials set does not appear to have been nearly as carefully vetted as it should have been. It may work but, then again, it may not. If the lack of power distribution changes is accurate, we may very well be reading about Nvidia Defective Chipsgate II in a couple of years.
How widespread is the problem? We told you about G84 and G86s as well as G92 and G94s. From the materials side, it appears that all non-R and non-F lot numbered parts made on the 65nm and 55nm processes are defective. The flaw is a downright idiotic choice of multiple materials coupled with poor chip design and inadequate testing. It is a case of errors compounding errors. They are all defective.
If this is the case, why aren't we seeing more defective desktop parts? That one is easy... thermal stress. It has two components that lead to a bump fracturing, the amount of the stress, that is the hot cold temperature delta, and the number of times the part is powered up and down, that is the heat cycle. Glass cups in the oven would be the amount of stress, the bended fork would be the number of cycles.
If you remember back to the Nvidia 8-K where they announced that "...customer use patterns are contributing factors." By customer usage patterns, they are referring mainly to thermal cycles, but you could also credit them with meaning high temperatures while the GPU is being pushed hard in gaming and the like.
Desktop systems are usually turned on once a day or so. Some people leave them on for weeks at a time, others may turn then on and off a few times in a day. The average desktop probably has about one heat cycle a day.
Laptops on the other hand are woken up and put to sleep many times a day. If you take a typical student who wakes up, checks his email, goes to three classes takes notes, goes to a coffee shop for a bit, goes home, watches a video or two, then goes to sleep, it is not hard to make a case for 10 or more power cycles a day. Every wake up/sleep or hibernate cycle is a heat cycle, so dozens are not out of the question.
The more cycles you put on it, and the more severe they are, the quicker these defective parts will die. A good way to look at it is to assign the lifespan of each critical bump an amount of stress it can take before it cracks. Lets call this number 100AU for Arbitrary Units. If a power on cycle is worth 4 AU, and a hardcore gaming session with the CPU OCd to within 1MHz of it crashing is worth 15, you can figure out when it should die. Remember, these are hypothetical numbers... the theory is the point.
When Dell, HP and others announce a BIOS 'fix', the reason it is so humorous is that all they are doing is lowering the amount of thermal stress on the chips when the fan would not normally be on. When the fan is going full tilt without the 'fix', the new 'updated thermal profiles' won't make a difference. When the fans are normally off or on low, the profiles will essentially lessen the stress from a four to a three. It is just there to allow the laptop to live through the warranty period so the companies don't have to pay for the fix. After that, if the defective chips burn out, it isn't their problem. The 'fix' doesn't fix anything at all.
In the end, it comes down to Nvidia screwing up badly on package engineering and testing, then trying as best they can to bury the problem while passing the buck. It appears that every Nvidia 65nm and 55nm part with high lead bumps and/or low Tg underfill are defective, it is just a question of how defective they are, and when they will die.
As far as we are able to tell, contrary to Nvidia's vague statements blaming suppliers, there are no materials defects at work here. Every material they used lived up to the claimed specs, and every material they used would have done the job while kept within the advertised parameters. Nvidia's engineering failures put overdue stress on the parts, and several failures compounded to make two generations of defective parts. The suppliers and subcontractors did exactly what they were told, Nvidia just told them to do the wrong thing.
When it started talking about this, Nvidia failed crisis management 101, and the coverup shows it doesn't care about consumers, just its bottom line. NV is doing exactly the wrong thing for the wrong reasons, and the lawyers circling with class action paperwork in hand are going to eat them alive.
The last time you had such a huge batch of defective GPUs, the company that did it swore up and down – just like Nvidia – that there was no problem despite forums filled with evidence to the contrary.
A few weeks later, they turned around and admitted there was a problem, and took a $1.1 Billion charge, placating customers and fending off lawsuits.
You know that as the Xbox 360 Red Ring of Death.
I wonder why Nvidia can't be that smart? µ