Sat 22 Nov 2008

RSS Feed

Edited by Paul Hales

Published by Incisive Media Investments Ltd.

Terms and Conditions of use.

To advertise in Europe e-mail here

To advertise in Asia email here.

To advertise in North America email here.

Join the INQbot Mail List for a weekly guide to our news stories:

Subscribe

Why Nvidia's chips are defective

Part One A long and complex story

This the first part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part two can be found here and Part Three is here.

NVIDIA HAS RECENTLY been saying a lot about how it's chips are not bad, and giving people reasons about why the problem is contained. Unfortunately, these disingenuous half-truths don't stand up to an explanation of why this problem is happening.

The problem is extremely complex and defies a simple explanation. It involves multiple poor choices, multiple engineering failures, and likely a few bad accounting choices. This piece could also have been entitled: "More than you ever wanted to know about bumping, and then some: How not to do things". But we will simplify the science and technical details as much as possible to make it accessible, so some things may be oversimplified.

The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.

The end result of the failures is that bumps crack between the bump and the substrate on a chip, not on the bump to die side. When this happens to a signal bump, game over for the GPU or MCP. What is a bump, die and substrate? Why is it happening? That is a long and technical story.

A Via CN chip, note the die in the centre

First, let's start out with some terminology, illustrated here by the lovely and talented Via CN/Nano chip. As you can see, the total package is about the same area as a US quarter. The most important part is the black square at the centre, that is the die, or the silicon chip itself. The green fibreglass-like part around it is the substrate, a complex multi-layered organic material that routes signals from the pads on top to the pins on the bottom, and serves as an attachment point for the die and various passive components. Those are the little silver things around the edges.

The die itself looks a little rough around the edges, but in reality it is very very angular. It has four corners at 90 degree angles, this one being almost square. Some, like the Intel Atom for example, are much more rectangular. The blurry edges are due to a material called the underfill, it looks like glue seeping from the edges, and serves as mechanical support for the die to substrate bonds and a moisture barrier to protect the bumps.

Via CN about as thick as a quarter

The part you don't see are the bumps, and they are the most critical part. This type of packaging is called flip-chip because the connectors between the die and the substrate are put on the bottom of the die, and it is flipped over onto the package. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimetre on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight.

As you can see, the package is about the same height as a quarter as well, so the vertical tolerances are also pretty slim. The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated.

Those are the biggest players in our little drama, now let's move on to some basic physics and related science. Chips consume power, and in return they give you heat and a few electrons in the right places, occasionally they also give you a flash of light and smoke as well, but few chips do that twice. Heat is not an intended product, it is a consequence, and has to be carried away or bad things happen.

Modern chips consume electricity in an uneven manner, as different parts of the chip use power at different rates. Sometimes parts of the chip are never used at all for a given workload. If you have a modern GPU and don't game or are smart enough to not run Vista, you will likely never touch the transistors that do all the 3D work. Think about it this way, there are hot spots on the chip as well as cold spots, it is uneven and changing constantly.

A typical IR photograph of a multi-core CPU

Related to this is the fact that the chip uses electricity in a non-uniform manner. Parts that are heavily used pull much more current than idle parts, and once again, those parts change over time. Some bumps may pull a lot of Amps, others may pull very few, and this again changes over time and use. The bumps also have a limited current capacity each, too much and they melt or burn out, so there are far more than are strictly needed to supply the chip with power.

The idea is to make sure no one bump will ever reach the maximum current it can handle. This is done by putting in more power bumps on the die in places that use high power than are needed from an average current point of view. If things are done right, no single bump will ever exceed the maximum current it can deliver.

The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic, see here and here. Eutectic materials have two important properties, they have a low melting point, and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.

Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than you do if you use a high-lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high-lead bump on eutectic pad route.

High-lead bumps have a much higher current capacity than eutectic bumps. When power is run through eutectic bumps, you also get an effect called electromigration. This means that some of the materials are essentially pushed around by the current, and you get voids in the bump. These voids lessen the capacity of the bump, and eventually they burn out.

The more current you run through a eutectic bump, the quicker the electromigration. If you keep the current to a reasonable level, the time it takes for this to happen will be so long it isn't worth worrying about. This is why chip vendors say that upping the voltage will shorten the lifespan of parts, it literally does cause them to burn out quicker.

On the good side, eutectic bumps are generally more flexible than high lead. This means they are a bit more forgiving to stress. Some forces that would fracture a lead bump may be absorbed by a eutectic one without problems.

Bumps overall are a multi-dimensional trade-off between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.

Comments

Thanks!

Good article,
posted by : Kef, 01 September 2008

How about ATI?

I work at a UK based computer hardware supplier and we see roughly the same amount of RMA's for Nvidia and ATI cards. I don't doubt that Nvidia might have cut corners/made bad engineering choices, but is there any firm evidence that ATI are not doing the same thing themselves?

Looking around various hardware forums i see no evidence of mass failure of Nvidia cards. Laptops are obviously seeing a problem but i am not sure that the same thing can be said for desktop parts.
posted by : bumwizard, 01 September 2008

Very Informative

i like the way u wrote this article without being too sarcastic or critical of Nvidia, and instead sticking to explaining the facts.
posted by : wishingwell, 01 September 2008

Wait for it...

"whine/moan Charlie is biased against nVidia blah blah..."

... thank you, once again, for a clearly reasoned and well thought out look at the reality of the situation, which will inevitably be cited by ignorant fanbois as blatant bias.
posted by : Mike, 01 September 2008

wow!

wow!, I am surprised at the technical know-how of Charlie in packaging technology.

Has intel started giving prepared articles also to publish, in addition to lots of money to Inquirer???
posted by : rj, 01 September 2008

Makes sense

Great article Charlie!
I like such details!
posted by : Michi, 01 September 2008

Perplexing.

Meanwhile people discovered that furmark which is used to test graphics cards suddenly runs slow on ATI 5850/70 cards with the latests drivers and then they found out that those drivers specifically recognise the furmark exe and slows it down, because apparantely the 4850 overheats and according to some actually dies sometimes when running it.
In other word the ATI cards cannot run at 100% it seems, now when will we read about that on the inq? Perhaps in a article named 'why ATI cards are defective'?
And why didn't that bit of news show up yet on the inq? At least charlie has the excuse of sleeplessness and jetlag and distraction from the Intel IDF thing, but not everybody at the inq does surely.
posted by : W.-, 01 September 2008

The how and the why...

Well well....

Thanks Charlie for giving us the technical reasons for how and why NV 55nm and 65nm chips are failing. It should be pretty obvious by now to NV fanboys that these reports are based on facts and science, not "NV Hate". Bad chip designs from the beginning. It's all going to come out, no matter how NV tries to spin it. Watch and see.
posted by : Eric, 01 September 2008

Good work

Finally some detailed explanation as to why you've been ranting about Nvidia soo much.
posted by : Alex, 01 September 2008

Damaged by Heat.

This is one of better articles Charles has written on Complex subject. Simple area is that chip only does one of four things. It inverts voltage, it passes voltage, it converts voltage to only high or only low string. thats it, it depends on instruction to comparrison unit for each blast.

How to keep chip stable to perform those simple functions that then go to output of memory or device is question. Chip isn't so new to need deep understanding of physics to make it work, yet it helps in knowing why it dosn't work.

You mention high voltage & high resistance, good. Both Culprits, Maybe in with Mike Magee. I don't know. Gags are only funny when things are right.

Third culprit in this story is unmentioned in detail, although mentioned. Beyond temperature of cystalization of solder, theres heat itself migrating to new transistors during Manufacturer. that is reason Modern CPU has mechanical connection by grippers for Pins.

Maybe at such smaller scale, some critical transistors are being preburnt, forcing higher voltage problems, leakage El Maximo & just warmed up like toast, crispy edges of nodules of baked metal & substrate cyanide mixed. Making Voltage passage irregular due to Resistance variations thruout on transistor by transistor basis. Tin melts about 490F, Lead melts at 620F, So poor little Dralins' are Being Fried before warm mellow constant cooking. By being too close to solder point, or solder point too Hot.Or cooling of solder NOT deep enough Nor Strong enough. Say after few good hacks machinery holding metal contacts gets too hot & too Much heat passses into gpu transistors.

Reverand Tom States: Ashes to Ashes & Defectives to Trash.
drashek
posted by : Doc_Tom, 01 September 2008

yawns

The AMD advert on these pages speaks volumes.
posted by : John, 01 September 2008

I don't get it...

First of all, nice explanation of the tech obstacles and compromises of CPU/GPU engeneering.

Yhe proble is with all your hand waving and prancing around angrily I am still not seeing ANY significant failure rates in nVidia video cards.

I have over 100+ desktops all running high performance nVidia GPUs and many with nVidia mainboard chipsets all running hi res graphic workstations.

Where are the failures? Why am I not seeing increased failure rates?

This is the problem with your entire campaign against nVidia and their engineering choices. Engeneering choices that every high current/temp chip manufacturer must face.

So listen Chuck, where's the meat?

CD Baric
posted by : CD Baric, 01 September 2008

Some of you guys are pretty dense

"But, but the AMD add that is (or isn't in my case) at the top of the page means that the Inq is obviously biased'.

Give me a break. The Inq is so, so kind to AMD and it's Phenomenal failure. nVidia made a poor engineering decision. Period. They have a lot of bad chips out there, you're just seeing the laptop stuff die first because it's a much more stressful environment with a lot more heat cycling.

As for the 4850, that's more a case of trying to cool a card that should have had a dual slot cooler with a single slot cooler. That, should it become a major problem, could be remedied a hell of a lot easier than nVidia's current mea culpa.
posted by : Nate, 02 September 2008

Great Article...

Charlie, This article along with part two was great. Thanks for the technical explanation of what is going on. One question for the general public, though... Why not just use Liquid Cooling and lessen your chances of thermal failure?
posted by : John, 02 September 2008

nice... but...

Nice article, but please, when trying to give evidence of the size of something, could you use a non nation specific scale.

I have no idea what size a quarter is, except it's going to be in the coin size region. If it's the size of a UK 5p, or a UK £2 coin is unknown to me.

Could you try a universal scale, say a ruler. I know there are two scales for this, metric and imperial, luckily most rulers I have seen cater for this, showing both scales.

If the reporter responsible is in the US and unable to obtain such a multi-scale ruler, I will happily send him one!
posted by : Steve, 02 September 2008

Looking forward to Part 3

Without a lot of measuring and testing, one can't be sure Charlie is correct. However, the two verifiable facts - the $200 million USD charge and the PCNs - do fit his explanation nicely.

I'm not sure what joke drashek found - no resistance equals no heat, increasing resistance or voltage means increased heat. This isn't your "Warning: 10,000 Ohms" sniggering opportunity.

And these fanbois - classic case studies in deviant psych. Shouldn't you all be at your religion classes discussing how we should all think about more important issues?
posted by : Tam Lin, 02 September 2008

Come again?

I don't know if anyone else is noticing this, but if Charlie wrote this article, he must have slept at a Holiday Inn the night before. This writing is clearly not representative of his style or content. Not only that, but suddenly he's an engineer...?
It's a plausible flaw, but I don't see the credible evidence.
posted by : Zooterboy, 02 September 2008

Good Job

Great article. You stuck to the hard facts and not even the hardest NV fanboy can argue the truth. NV messed up, and now we know why. Know, if only they knew.....lol

Seriously though, I think Charlie here knows more about chips than the current Supervisor at NV. Go apply for a job and save them!
posted by : RichasB, 02 September 2008

How come nobody else is complaining?

I don't see any firms stating 40% failure rates on any cards (AMD or NV). Surely that would be HUGE news.

Nothing you've explained proves anything about NV's parts. Still just opinion crap that isn't provable. List sources at companies. Or better yet post their emails to you that are complaints about NV failures so we can read them for ourselves.

Ohh, that's right, you have no evidence, just more whining. Go back to handing out those flyers again...LOL. Get a few friends, your 3 person protest didn't work...ROFL. You're not the only one with connections in this business and NONE of them are writing this stuff or saying it. IF NV had 40% failures on their cards we'd all know, and on top of that they would have been taking a BILLION charge, not 196mil (confirmed as a ONE TIME ONLY charge during conference call - or did you miss that?). :)
posted by : The Jian, 02 September 2008

Don't people read?

Apparently a lot of people who went through the effort of commenting on this article failed to read the previous articles by chucky on this topic.

His last article claimed that more chips are defective due to the exact same reason as the previous chips ... that 'same' reason being related to the replacement of the solder even though the original reason was the very different relation to the replacement of the underfill material.

Chucky also has not acknowledged that Nvidia must use low lead materials to be RoHS compliant, simply to be allowed to sell Nvidia parts into some markets.

Chucky really has no idea if the bumps or underfill are causing the chip failures, he has just observed the blind correlation between the material switching and shipping of supposedly fixed products.

This is no different than the natural health product scams or the general scam of religion, in that they all use coincidence as proof of effectiveness (or defectiveness).

Readers should also be reminded that by Chucky not only kept writing about Vista, but kept using Vista long long after his long brutal rant about how he would never use or write about Vista ever again.

Chucky is just trying to hide the shame of his Vista lies with the shame of pretending he knows anything at all about the high failure rates of recent Nvidia products.

It is quite apparent that Chucky's biggest experience with any lead containing product was the paint chips he gnawed of the walls of his childhood home.
posted by : Ken, 02 September 2008

Old iBooks

if you lot can remember as far back as 2003, there was a large scale replacement of the white G3/G4 iBook due to the GPU BGA bump detachment over time. I personally got Apple to change a 600MHz and a 1GHz.
posted by : Davey, 02 September 2008

Laptop / desktop

There's a good reason why desktop parts are not *yet* showing up in large quantities: they aren't as stressed as the laptop parts.

The desktop parts aren't cycled as often and are generally more effectively cooled. It doesn't make them any less likely to fail - it simply means that the NV desktop chips are on a longer fuse before the bomb detonates, that's all.
posted by : Jonathan, 02 September 2008

Twice resistance=Twice Heat

Heat Lowers Resistance, As light Bulb has very high resistance, yet when it approachs red hot, resistance goes down & final white hot tungstun has nearly NO resistance, few Ohms. Yet if item has 2 ohms resistance, then made to 4 ohm, twice heat is generated at 4 ohm than two in same circumstance. Maybe HIGH K gates are solution. Made for Heat.

Cyandie is Metal, yet if Manufacturing Process is flawed, resistance of Metal greatly increases or worse goes down to nil, allowing way too much current to pass.
I am just speculating that somewhere in RED Zone in photo, transistors have been corrupted in Manufacturer & it takes but few to screw up transfer to internal chip memory & become much hindered by inherent defects.Gasified Metal recondensed into aminoliquid slime, changing Metal characteristics, at Gate. Over Blasting Next Gates with Flawed Signal.

Plug in Video memory was standard, then advances in manufacturing made it hardwired again. Funny that CPU isn't advanced back to hardwired, except 1,300 connection points is bit much to solder.
Perhaps snap in, grip in, slot chips have something to offer in this problem, to avoid manufacturering heat.Basic first thoughts, Tam.
drashek
posted by : OMG, 02 September 2008

Bump

Just had to share an "attaboy" for the more detail and less flame thrower.
posted by : Vinster, 02 September 2008

Ignorant Crank

Thanks for another worthless opinion piece! Ever gonna put some substance into these fluffy articles???
"Mac - I'm not a Mac! Why doncha turn your lights on, ya mo-ron"
posted by : Grunchy, 02 September 2008

OMG you ignorant

Heat raises resistance in most materials. True, there are certain materials, like semiconductors, with a negative temperature coefficient (NTC), but the vast majority are the opposite. Bill Hewlett (of Hewlett-Packard) first exploited tungsten's POSITIVE temperature coefficient to stabilize an oscillator by putting a light bulb in the feedback loop. That was in Hewlett-Packard's very first product: an audio oscillator. The rest is history.

The rest of your post is pretty much gibberish. High-K gates have lower leakage due to better oxide performance; they don't handle heat better, they dissipate less heat at idle, is all.
posted by : TalkinSense, 02 September 2008

Very scientific

A very long and detailed story but could be explained easier - bad choice of solder make joints break early.



posted by : unfortunate, 03 September 2008

Good Reason for High Lead Bumps

Nice article, but let me add the good reason for high lead bumps. Eutectic solder has a much lower melting temperature, and that is the traditional alloy for attaching the package board to the mother board. You want high lead for attaching the die to the package so that those bumps don't melt when the package is attached to the mother board. Of course this is all getting thrown out the window in the face of Rohs compliance, so be ready for more problems in the future as well as higher costs.
posted by : John Mardinly, 03 September 2008

quick question

You mention 40% of specific parts have early failure rates. Which specific parts, and what is considered early failure? This would be good info to give everyone a baseline.
posted by : Bounty, 03 September 2008

simple question...

eutectic isn't the way to go then... neither is high-lead on eutectic...
why didn't they just switch to high-lead organic substrates? That seems like it would solve the current density, stress, and MTBF problems all at once.

Let me guess, RoHS considerations?
posted by : Ken Stein, 04 September 2008

Technical corrections-nVidia

Wanted to correct some statements on the nVidia article, at a risk of getting too technical.
1. High lead bumps (typically 95/5 Pb/Sn) are more ductile than "eutectic" (~ 63/37 Sn/Pb). Eutectic bumps are stronger and more resistant to fatigue ('broken fork' model) but this increased stiffness translates more stress to the GPU and can lead to delamination in the multi-layer stack of dielectric and copper that interconnects all the circuits on the GPU. Intel has moved to plating copper columns that are capped with a lead-free solder on recent 45 nm processors. The stiff copper column transfers a lot of stress to the column - processor interface, so Intel had to do a lot of engineering work to make this interface reliable.
2. I believe nVidia used eutectic on older technology GPUs, and changed to high lead flip chip bumps when they went to 90 nm and 65 nm technology nodes for the GPUs. This was done specifically to try to avoid cracking / delamination between the bump and the interconnect stack on the GPU. Bonding a high lead bump to the eutectic solder cladding on the substrate bond pad results in the molten eutectic solder dissolving some of the high lead bump, creating a tin-rich phase bonding the remaining unaffected high lead bump volume to the bond pad on the substrate. There is nothing wrong with this connection.
3. Flip chip bumps arrayed on a GPU on a pitch of ~180 - 200 micrometers, are typically sized such that they can handle ~100 milli-amps per bump with a junction temperature of ~100 Celsius for about 10 - 20 years of use.
4. One aspect that is not noted in the story is that the thermal management solution (heatsink, how the heat sink is pressed against the back side of the GPU, how much pressure is applied to the GPU back side, how the GPU package is supported on the board, etc., has a tremendous impact on the integrity of the flip chip bump interconnect. The greater the pressure on the GPU, the greater the potential damage to the interfaces of the bump to the GPU and the bump to the substrate pad.
5. The stated risk of the polyIMIDE (not polyamide)coating on the GPU being torn from the die is not the risk imposed by a high Tg / high stiffness (high modulus). If the underfill is too stiff, it may protect the bumps from any form of degradation quite nicely, but it may succeed in causing the multilayer stack of insulating layers and copper to delaminate from the GPU, typically starting in the corners of the GPU. So it is correct that the underfill must be carefully chosen based upon the material properties of thermal expansion (above and below the Tg where it softens a lot), the modulus of elasticity (stiffness) over the intended temperature range of use, and Tg.
6. So since the thermal solutions in laptops and desktops are so different, while the normal (as supplied) junction temperatures can be expected to be about the same, the use of high pressure on the die backside, with a junction temperature higher than the Tg (underfill softens a lot), with a lot of power cycling, aggravated by electromigration (atoms diffuse in the direction of the electron flow), can lead to accelerated degradation of the flip chip bump bond interfaces.
7. Flux is used to help the flip chip solder bumps flow and bond well, and the residue of this flux is often not removed before the underfill is applied, or is the residue is removed with cleaning processes, some residue is commonly trapped betwee the solder connection and the green solder mask material on the surface of the substrate. This residue can melt at low temperatures and further aggravate the reliability of the interconnection.

Flip chip packaging is complicated....
posted by : Pkgg Engr, 09 September 2008
IThound
Search for solutions, reports & analysis

Newsletter signup



 

Top INQ Stories