Jump to content
The Inquirer-Home

Why Nvidia's chips are defective

Part One A long and complex story
Monday, 1 September 2008, 20:00

This the first part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part two can be found here and Part Three is here.

NVIDIA HAS RECENTLY been saying a lot about how it's chips are not bad, and giving people reasons about why the problem is contained. Unfortunately, these disingenuous half-truths don't stand up to an explanation of why this problem is happening.

The problem is extremely complex and defies a simple explanation. It involves multiple poor choices, multiple engineering failures, and likely a few bad accounting choices. This piece could also have been entitled: "More than you ever wanted to know about bumping, and then some: How not to do things". But we will simplify the science and technical details as much as possible to make it accessible, so some things may be oversimplified.

The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.

The end result of the failures is that bumps crack between the bump and the substrate on a chip, not on the bump to die side. When this happens to a signal bump, game over for the GPU or MCP. What is a bump, die and substrate? Why is it happening? That is a long and technical story.

Cn_top_viewA Via CN chip, note the die in the centre

First, let's start out with some terminology, illustrated here by the lovely and talented Via CN/Nano chip. As you can see, the total package is about the same area as a US quarter. The most important part is the black square at the centre, that is the die, or the silicon chip itself. The green fibreglass-like part around it is the substrate, a complex multi-layered organic material that routes signals from the pads on top to the pins on the bottom, and serves as an attachment point for the die and various passive components. Those are the little silver things around the edges.

The die itself looks a little rough around the edges, but in reality it is very very angular. It has four corners at 90 degree angles, this one being almost square. Some, like the Intel Atom for example, are much more rectangular. The blurry edges are due to a material called the underfill, it looks like glue seeping from the edges, and serves as mechanical support for the die to substrate bonds and a moisture barrier to protect the bumps.

Cn_side_viewVia CN about as thick as a quarter

The part you don't see are the bumps, and they are the most critical part. This type of packaging is called flip-chip because the connectors between the die and the substrate are put on the bottom of the die, and it is flipped over onto the package. The connectors are called bumps, and they are literally little balls of solder. A typical chip that is a little more than a centimetre on a side might have over 1000 bumps on it, so spacing is incredibly small and tolerances amazingly tight.

As you can see, the package is about the same height as a quarter as well, so the vertical tolerances are also pretty slim. The bumps act like pins on a normal chip, they carry signals, power and ground to and from the die. They also are the primary attachment mechanism of the die to the substrate. The precision needed to put these things together should not be underestimated.

Those are the biggest players in our little drama, now let's move on to some basic physics and related science. Chips consume power, and in return they give you heat and a few electrons in the right places, occasionally they also give you a flash of light and smoke as well, but few chips do that twice. Heat is not an intended product, it is a consequence, and has to be carried away or bad things happen.

Modern chips consume electricity in an uneven manner, as different parts of the chip use power at different rates. Sometimes parts of the chip are never used at all for a given workload. If you have a modern GPU and don't game or are smart enough to not run Vista, you will likely never touch the transistors that do all the 3D work. Think about it this way, there are hot spots on the chip as well as cold spots, it is uneven and changing constantly.

Guess_the_chip_thermal_picA typical IR photograph of a multi-core CPU

Related to this is the fact that the chip uses electricity in a non-uniform manner. Parts that are heavily used pull much more current than idle parts, and once again, those parts change over time. Some bumps may pull a lot of Amps, others may pull very few, and this again changes over time and use. The bumps also have a limited current capacity each, too much and they melt or burn out, so there are far more than are strictly needed to supply the chip with power.

The idea is to make sure no one bump will ever reach the maximum current it can handle. This is done by putting in more power bumps on the die in places that use high power than are needed from an average current point of view. If things are done right, no single bump will ever exceed the maximum current it can deliver.

The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic, see here and here. Eutectic materials have two important properties, they have a low melting point, and all components crystallize at the same temperature. This means they are easier to work with, and form a good solid bond. Eutectic bumps may have lead in them, or they may not, some are gold/tin, other are lead based, it depends on what characteristics you want, and how much you want to pay. It is a property, not a formula.

Most if not all substrates use eutectic pads to attach the bumps to as well. If you use a eutectic pad with a eutectic bump, you get a much better connection than you do if you use a high-lead bump with a eutectic pad. This is reflected in much higher yields, lower assembly costs, and a physically stronger connection as well. At this time, we have no good explanation as to why Nvidia chose to go the high-lead bump on eutectic pad route.

High-lead bumps have a much higher current capacity than eutectic bumps. When power is run through eutectic bumps, you also get an effect called electromigration. This means that some of the materials are essentially pushed around by the current, and you get voids in the bump. These voids lessen the capacity of the bump, and eventually they burn out.

The more current you run through a eutectic bump, the quicker the electromigration. If you keep the current to a reasonable level, the time it takes for this to happen will be so long it isn't worth worrying about. This is why chip vendors say that upping the voltage will shorten the lifespan of parts, it literally does cause them to burn out quicker.

On the good side, eutectic bumps are generally more flexible than high lead. This means they are a bit more forgiving to stress. Some forces that would fracture a lead bump may be absorbed by a eutectic one without problems.

Bumps overall are a multi-dimensional trade-off between cost, assembly yield, current capacity and mechanical resilience among other things. To call it a complex mess is being overly kind, package engineering is not for the faint of heart.

FROM BUMP properties, we move on to thermal expansion of materials, and that is another piece to the puzzle. Most materials expand as they warm up. If you have ever seen a mechanic trying to free a stuck bolt, they usually heat the nut with a blowtorch, this expands the nut and loosens it. The same thing happens with the die and substrate. When you turn on a chip, it heats and expands a little. This expansion is not much, but it is measurable. The substrate also heats and expands.

The problem is that the die gets hot, and heats the substrate secondarily. The silicon on the die has one rate of thermal expansion, the substrate has another, basically they get bigger at different rates. To complicate things further, remember the uneven and changing heating bit above? Parts of the die heat up and expand differently from other parts of the die. This changes quite quickly while things are in use.

The result? The bumps take a lot of stress, and it changes from second to second. This can be very accurately simulated, and you can engineer bump placement at points of lower thermal expansion and therefore lower stress. If you lose a power bump here and there, the chip will very likely survive, but if you lose a signal bump, game over. This is why bump placement is very important.

Engineering what bumps go where is a very complex process, and is done basically when the chip is laid out, near the end of the development process. You don't do it on a whim, you don't make pretty patterns because they are cool, you do it scientifically to minimise the potential for damage.

Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.

Once again, if you did your engineering right, this won't happen in any timeframe that matters to mere humans, if it takes ten years of on and off switching to make it happen, once a day power cycling won't matter in our lifetimes. Chip makers tend to engineer for timelines like the ten-year horizon, and are pretty safe in assuming it will live for five years of casual use.

If you recall, high-lead bumps are stiffer than eutectic and more prone to stress fractures. The high-lead-to-eutectic substrate bond is also weaker than a eutectic-to-eutectic bond. What is happening to Nvidia is that the substrate to bump joint is cracking, and the chips die. High lead bumps are a poor choice to use in this application.

One other bit to bring into the mix is underfill. If things were as simple as heat leads to cracking, no chips would work for any length of time. Underfill not only protects the bumps from moisture and contamination, but it also provides mechanical support as well. It is designed to take some of the stress that the bumps take, making them live longer.

Underfill can range from rock hard to soft and squishy, it depends on your application. The harder the underfill, the more mechanical support it provides, and the less stress the bumps take. Simple enough.

That brings us to another material, the Polyamide layer. When chips went to a low-K dielectric material, which is not the same as the high-K gate material, it proved a problem with packaging, bumps and underfill. The solution was to put a polyamide layer, sometimes called a stress layer, to cover the bottom of the chip. This prevents contamination and mechanical damage.

If you pick an underfill that is too soft, it doesn't provide you enough mechanical support for the bumps, they crack and your chip dies and early death. Pick one that is too hard and it rips the polyimide layer off. In the words of one packaging engineer talked to for this article, if you used too hard of an underfill, the chip "wouldn't survive the first heat cycle". The magic is in the middle, you have to pick a bowl of porridge, er, underfill, that is strong enough to provide the support you need, but not so strong as to rip layers off your chip. Like we said, package engineering is not for the faint of heart, but it can make baby bear happy.

That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.

If you take a hypothetical simple CPU that has an integer and floating point units. If you are doing heavy int. work, the power bumps that supply that part of the chip will be loaded heavily and the FP bumps will not be doing much of anything at all. When FP load gets heavy, the opposite happen.

The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won't get all that close to their maximum. To use completely made up numbers, take a bump has a peak capacity of 1000mA, and for longevity you don't want to exceed 800mA, basically a 20 per cent safety margin.

If the chip TDP divided by the number of bumps, IE the average current per bump is 200mA, there are likely many bumps drawing 100mA and a few under loaded areas that draw 600mA. This draw moves around with the work the chip is doing. Some may never break 100mA, others may be at 600mA for their entire lives. All are well below the 800mA average, much less the 1000mA max.

The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration becomes. Lets pick a hypothetical eutectic bump that has a peak capacity of 500mA and the same 20 per cent safety margin, 400mA max for long life. If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.

If the chip actually powers up without letting the smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case of that the gods of luck are staring right at you and the thing doe sn't fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.

What do you do? You can either cut the power used by the GPU way way down, ie, clock it at a point where no one would ever buy it, or rearrange where the bumps go. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can't be done and validated in the time the chip is on sale for.

The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.

All of these things can be dealt with if you see this coming when you start making the GPU. It is pretty painfully obvious that Nvidia didn't, otherwise they wouldn't have used high lead bumps and gotten into the hole that they are in. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts. µ

Part Two - The problem of underfill is here.

Share this:

Comments
Thanks!

Good article,

posted by : Kef, 01 September 2008 Complain about this comment
How about ATI?

I work at a UK based computer hardware supplier and we see roughly the same amount of RMA's for Nvidia and ATI cards. I don't doubt that Nvidia might have cut corners/made bad engineering choices, but is there any firm evidence that ATI are not doing the same thing themselves?

Looking around various hardware forums i see no evidence of mass failure of Nvidia cards. Laptops are obviously seeing a problem but i am not sure that the same thing can be said for desktop parts.

posted by : bumwizard, 01 September 2008 Complain about this comment
Very Informative

i like the way u wrote this article without being too sarcastic or critical of Nvidia, and instead sticking to explaining the facts.

posted by : wishingwell, 01 September 2008 Complain about this comment
Wait for it...

"whine/moan Charlie is biased against nVidia blah blah..."

... thank you, once again, for a clearly reasoned and well thought out look at the reality of the situation, which will inevitably be cited by ignorant fanbois as blatant bias.

posted by : Mike, 01 September 2008 Complain about this comment
wow!

wow!, I am surprised at the technical know-how of Charlie in packaging technology.

Has intel started giving prepared articles also to publish, in addition to lots of money to Inquirer???

posted by : rj, 01 September 2008 Complain about this comment
Makes sense

Great article Charlie!
I like such details!

posted by : Michi, 01 September 2008 Complain about this comment
Perplexing.

Meanwhile people discovered that furmark which is used to test graphics cards suddenly runs slow on ATI 5850/70 cards with the latests drivers and then they found out that those drivers specifically recognise the furmark exe and slows it down, because apparantely the 4850 overheats and according to some actually dies sometimes when running it.
In other word the ATI cards cannot run at 100% it seems, now when will we read about that on the inq? Perhaps in a article named 'why ATI cards are defective'?
And why didn't that bit of news show up yet on the inq? At least charlie has the excuse of sleeplessness and jetlag and distraction from the Intel IDF thing, but not everybody at the inq does surely.

posted by : W.-, 01 September 2008 Complain about this comment
The how and the why...

Well well....

Thanks Charlie for giving us the technical reasons for how and why NV 55nm and 65nm chips are failing. It should be pretty obvious by now to NV fanboys that these reports are based on facts and science, not "NV Hate". Bad chip designs from the beginning. It's all going to come out, no matter how NV tries to spin it. Watch and see.

posted by : Eric, 01 September 2008 Complain about this comment
Good work

Finally some detailed explanation as to why you've been ranting about Nvidia soo much.

posted by : Alex, 01 September 2008 Complain about this comment
Damaged by Heat.

This is one of better articles Charles has written on Complex subject. Simple area is that chip only does one of four things. It inverts voltage, it passes voltage, it converts voltage to only high or only low string. thats it, it depends on instruction to comparrison unit for each blast.

How to keep chip stable to perform those simple functions that then go to output of memory or device is question. Chip isn't so new to need deep understanding of physics to make it work, yet it helps in knowing why it dosn't work.

You mention high voltage & high resistance, good. Both Culprits, Maybe in with Mike Magee. I don't know. Gags are only funny when things are right.

Third culprit in this story is unmentioned in detail, although mentioned. Beyond temperature of cystalization of solder, theres heat itself migrating to new transistors during Manufacturer. that is reason Modern CPU has mechanical connection by grippers for Pins.

Maybe at such smaller scale, some critical transistors are being preburnt, forcing higher voltage problems, leakage El Maximo & just warmed up like toast, crispy edges of nodules of baked metal & substrate cyanide mixed. Making Voltage passage irregular due to Resistance variations thruout on transistor by transistor basis. Tin melts about 490F, Lead melts at 620F, So poor little Dralins' are Being Fried before warm mellow constant cooking. By being too close to solder point, or solder point too Hot.Or cooling of solder NOT deep enough Nor Strong enough. Say after few good hacks machinery holding metal contacts gets too hot & too Much heat passses into gpu transistors. 

Reverand Tom States: Ashes to Ashes & Defectives to Trash.
drashek

posted by : Doc_Tom, 01 September 2008 Complain about this comment
yawns

The AMD advert on these pages speaks volumes.

posted by : John, 01 September 2008 Complain about this comment
I don't get it...

First of all, nice explanation of the tech obstacles and compromises of CPU/GPU engeneering.

Yhe proble is with all your hand waving and prancing around angrily I am still not seeing ANY significant failure rates in nVidia video cards.

I have over 100+ desktops all running high performance nVidia GPUs and many with nVidia mainboard chipsets all running hi res graphic workstations.

Where are the failures? Why am I not seeing increased failure rates?

This is the problem with your entire campaign against nVidia and their engineering choices. Engeneering choices that every high current/temp chip manufacturer must face.

So listen Chuck, where's the meat?

CD Baric

posted by : CD Baric, 02 September 2008 Complain about this comment
Some of you guys are pretty dense

"But, but the AMD add that is (or isn't in my case) at the top of the page means that the Inq is obviously biased'.

Give me a break. The Inq is so, so kind to AMD and it's Phenomenal failure. nVidia made a poor engineering decision. Period. They have a lot of bad chips out there, you're just seeing the laptop stuff die first because it's a much more stressful environment with a lot more heat cycling.

As for the 4850, that's more a case of trying to cool a card that should have had a dual slot cooler with a single slot cooler. That, should it become a major problem, could be remedied a hell of a lot easier than nVidia's current mea culpa.

posted by : Nate, 02 September 2008 Complain about this comment
Great Article...

Charlie, This article along with part two was great. Thanks for the technical explanation of what is going on. One question for the general public, though... Why not just use Liquid Cooling and lessen your chances of thermal failure?

posted by : John, 02 September 2008 Complain about this comment
nice... but...

Nice article, but please, when trying to give evidence of the size of something, could you use a non nation specific scale.

I have no idea what size a quarter is, except it's going to be in the coin size region. If it's the size of a UK 5p, or a UK £2 coin is unknown to me.

Could you try a universal scale, say a ruler. I know there are two scales for this, metric and imperial, luckily most rulers I have seen cater for this, showing both scales.

If the reporter responsible is in the US and unable to obtain such a multi-scale ruler, I will happily send him one!

posted by : Steve, 02 September 2008 Complain about this comment
Looking forward to Part 3

Without a lot of measuring and testing, one can't be sure Charlie is correct. However, the two verifiable facts - the $200 million USD charge and the PCNs - do fit his explanation nicely.

I'm not sure what joke drashek found - no resistance equals no heat, increasing resistance or voltage means increased heat. This isn't your "Warning: 10,000 Ohms" sniggering opportunity.

And these fanbois - classic case studies in deviant psych. Shouldn't you all be at your religion classes discussing how we should all think about more important issues?

posted by : Tam Lin, 02 September 2008 Complain about this comment
Come again?

I don't know if anyone else is noticing this, but if Charlie wrote this article, he must have slept at a Holiday Inn the night before. This writing is clearly not representative of his style or content. Not only that, but suddenly he's an engineer...?
It's a plausible flaw, but I don't see the credible evidence.

posted by : Zooterboy, 02 September 2008 Complain about this comment
Good Job

Great article. You stuck to the hard facts and not even the hardest NV fanboy can argue the truth. NV messed up, and now we know why. Know, if only they knew.....lol

Seriously though, I think Charlie here knows more about chips than the current Supervisor at NV. Go apply for a job and save them!

posted by : RichasB, 02 September 2008 Complain about this comment
How come nobody else is complaining?

I don't see any firms stating 40% failure rates on any cards (AMD or NV). Surely that would be HUGE news.

Nothing you've explained proves anything about NV's parts. Still just opinion crap that isn't provable. List sources at companies. Or better yet post their emails to you that are complaints about NV failures so we can read them for ourselves.

Ohh, that's right, you have no evidence, just more whining. Go back to handing out those flyers again...LOL. Get a few friends, your 3 person protest didn't work...ROFL. You're not the only one with connections in this business and NONE of them are writing this stuff or saying it. IF NV had 40% failures on their cards we'd all know, and on top of that they would have been taking a BILLION charge, not 196mil (confirmed as a ONE TIME ONLY charge during conference call - or did you miss that?). :)

posted by : The Jian, 02 September 2008 Complain about this comment
Don't people read?

Apparently a lot of people who went through the effort of commenting on this article failed to read the previous articles by chucky on this topic.

His last article claimed that more chips are defective due to the exact same reason as the previous chips ... that 'same' reason being related to the replacement of the solder even though the original reason was the very different relation to the replacement of the underfill material.

Chucky also has not acknowledged that Nvidia must use low lead materials to be RoHS compliant, simply to be allowed to sell Nvidia parts into some markets.

Chucky really has no idea if the bumps or underfill are causing the chip failures, he has just observed the blind correlation between the material switching and shipping of supposedly fixed products. 

This is no different than the natural health product scams or the general scam of religion, in that they all use coincidence as proof of effectiveness (or defectiveness).

Readers should also be reminded that by Chucky not only kept writing about Vista, but kept using Vista long long after his long brutal rant about how he would never use or write about Vista ever again.

Chucky is just trying to hide the shame of his Vista lies with the shame of pretending he knows anything at all about the high failure rates of recent Nvidia products.

It is quite apparent that Chucky's biggest experience with any lead containing product was the paint chips he gnawed of the walls of his childhood home.

posted by : Ken, 02 September 2008 Complain about this comment
Old iBooks

if you lot can remember as far back as 2003, there was a large scale replacement of the white G3/G4 iBook due to the GPU BGA bump detachment over time. I personally got Apple to change a 600MHz and a 1GHz.

posted by : Davey, 02 September 2008 Complain about this comment
Laptop / desktop

There's a good reason why desktop parts are not *yet* showing up in large quantities: they aren't as stressed as the laptop parts.

The desktop parts aren't cycled as often and are generally more effectively cooled. It doesn't make them any less likely to fail - it simply means that the NV desktop chips are on a longer fuse before the bomb detonates, that's all.

posted by : Jonathan, 02 September 2008 Complain about this comment
Twice resistance=Twice Heat

Heat Lowers Resistance, As light Bulb has very high resistance, yet when it approachs red hot, resistance goes down & final white hot tungstun has nearly NO resistance, few Ohms. Yet if item has 2 ohms resistance, then made to 4 ohm, twice heat is generated at 4 ohm than two in same circumstance. Maybe HIGH K gates are solution. Made for Heat.

Cyandie is Metal, yet if Manufacturing Process is flawed, resistance of Metal greatly increases or worse goes down to nil, allowing way too much current to pass. 
I am just speculating that somewhere in RED Zone in photo, transistors have been corrupted in Manufacturer & it takes but few to screw up transfer to internal chip memory & become much hindered by inherent defects.Gasified Metal recondensed into aminoliquid slime, changing Metal characteristics, at Gate. Over Blasting Next Gates with Flawed Signal.

Plug in Video memory was standard, then advances in manufacturing made it hardwired again. Funny that CPU isn't advanced back to hardwired, except 1,300 connection points is bit much to solder. 
Perhaps snap in, grip in, slot chips have something to offer in this problem, to avoid manufacturering heat.Basic first thoughts, Tam.
drashek

posted by : OMG, 02 September 2008 Complain about this comment
Bump

Just had to share an "attaboy" for the more detail and less flame thrower.

posted by : Vinster, 02 September 2008 Complain about this comment
Ignorant Crank

Thanks for another worthless opinion piece! Ever gonna put some substance into these fluffy articles???
"Mac - I'm not a Mac! Why doncha turn your lights on, ya mo-ron"

posted by : Grunchy, 02 September 2008 Complain about this comment
OMG you ignorant

Heat raises resistance in most materials. True, there are certain materials, like semiconductors, with a negative temperature coefficient (NTC), but the vast majority are the opposite. Bill Hewlett (of Hewlett-Packard) first exploited tungsten's POSITIVE temperature coefficient to stabilize an oscillator by putting a light bulb in the feedback loop. That was in Hewlett-Packard's very first product: an audio oscillator. The rest is history.

The rest of your post is pretty much gibberish. High-K gates have lower leakage due to better oxide performance; they don't handle heat better, they dissipate less heat at idle, is all.

posted by : TalkinSense, 02 September 2008 Complain about this comment
Very scientific

A very long and detailed story but could be explained easier - bad choice of solder make joints break early.




posted by : unfortunate, 03 September 2008 Complain about this comment
Good Reason for High Lead Bumps

Nice article, but let me add the good reason for high lead bumps. Eutectic solder has a much lower melting temperature, and that is the traditional alloy for attaching the package board to the mother board. You want high lead for attaching the die to the package so that those bumps don't melt when the package is attached to the mother board. Of course this is all getting thrown out the window in the face of Rohs compliance, so be ready for more problems in the future as well as higher costs.

posted by : John Mardinly, 03 September 2008 Complain about this comment
quick question

You mention 40% of specific parts have early failure rates. Which specific parts, and what is considered early failure? This would be good info to give everyone a baseline.

posted by : Bounty, 03 September 2008 Complain about this comment
simple question...

eutectic isn't the way to go then... neither is high-lead on eutectic...
why didn't they just switch to high-lead organic substrates? That seems like it would solve the current density, stress, and MTBF problems all at once.

Let me guess, RoHS considerations?

posted by : Ken Stein, 04 September 2008 Complain about this comment
Technical corrections-nVidia

Wanted to correct some statements on the nVidia article, at a risk of getting too technical.
1. High lead bumps (typically 95/5 Pb/Sn) are more ductile than "eutectic" (~ 63/37 Sn/Pb). Eutectic bumps are stronger and more resistant to fatigue ('broken fork' model) but this increased stiffness translates more stress to the GPU and can lead to delamination in the multi-layer stack of dielectric and copper that interconnects all the circuits on the GPU. Intel has moved to plating copper columns that are capped with a lead-free solder on recent 45 nm processors. The stiff copper column transfers a lot of stress to the column - processor interface, so Intel had to do a lot of engineering work to make this interface reliable.
2. I believe nVidia used eutectic on older technology GPUs, and changed to high lead flip chip bumps when they went to 90 nm and 65 nm technology nodes for the GPUs. This was done specifically to try to avoid cracking / delamination between the bump and the interconnect stack on the GPU. Bonding a high lead bump to the eutectic solder cladding on the substrate bond pad results in the molten eutectic solder dissolving some of the high lead bump, creating a tin-rich phase bonding the remaining unaffected high lead bump volume to the bond pad on the substrate. There is nothing wrong with this connection.
3. Flip chip bumps arrayed on a GPU on a pitch of ~180 - 200 micrometers, are typically sized such that they can handle ~100 milli-amps per bump with a junction temperature of ~100 Celsius for about 10 - 20 years of use. 
4. One aspect that is not noted in the story is that the thermal management solution (heatsink, how the heat sink is pressed against the back side of the GPU, how much pressure is applied to the GPU back side, how the GPU package is supported on the board, etc., has a tremendous impact on the integrity of the flip chip bump interconnect. The greater the pressure on the GPU, the greater the potential damage to the interfaces of the bump to the GPU and the bump to the substrate pad. 
5. The stated risk of the polyIMIDE (not polyamide)coating on the GPU being torn from the die is not the risk imposed by a high Tg / high stiffness (high modulus). If the underfill is too stiff, it may protect the bumps from any form of degradation quite nicely, but it may succeed in causing the multilayer stack of insulating layers and copper to delaminate from the GPU, typically starting in the corners of the GPU. So it is correct that the underfill must be carefully chosen based upon the material properties of thermal expansion (above and below the Tg where it softens a lot), the modulus of elasticity (stiffness) over the intended temperature range of use, and Tg. 
6. So since the thermal solutions in laptops and desktops are so different, while the normal (as supplied) junction temperatures can be expected to be about the same, the use of high pressure on the die backside, with a junction temperature higher than the Tg (underfill softens a lot), with a lot of power cycling, aggravated by electromigration (atoms diffuse in the direction of the electron flow), can lead to accelerated degradation of the flip chip bump bond interfaces.
7. Flux is used to help the flip chip solder bumps flow and bond well, and the residue of this flux is often not removed before the underfill is applied, or is the residue is removed with cleaning processes, some residue is commonly trapped betwee the solder connection and the green solder mask material on the surface of the substrate. This residue can melt at low temperatures and further aggravate the reliability of the interconnection.

Flip chip packaging is complicated....

posted by : Pkgg Engr, 09 September 2008 Complain about this comment
Useful website for all those affected

If your HP laptop is showing signs of the above mentioned defect, there is a website that has a lot of useful information to help.

www.hplies.com

posted by : Angela, 14 April 2009 Complain about this comment
Just check the Hp forums, thousands of Nvidia GPU Laptops DIE!!

Just check the Hp forums, thousands of Nvidia GPU Laptops DIE after about 1 year of usage.
Just search over the net these words " HP defective GPU"

http://forums11.itrc.hp.com/service/forums/bizsupport/questionanswer.do?admit=109447626+1245333842556+28353475&threadId=1274587

posted by : Jhonny , 18 June 2009 Complain about this comment
if Nvidia is so bad, then what does that make the other graphic cards?

Well, after reading this really long article about how lousy nvidia is, all I can say is that I have have tried about every brand of video card out there before switching to (and sticking with, for about 7 years now) and I have never had a software compatability issue with a nvidia video card.

Wish I could say the same for the other brands of video cards. Quite a few of them are really bad about this too, like brands such as Raedon (don't know if I spelled that righ).

A few of the more modern pc games I have now are really graphic intensive, and I had to upgrade from a nvidia geoforce 5200 to a nvidiea geoforce fx 9100, but as long as I have always made sure that I had the required or better video card for the software I get, I have never had any such issues with a nvidia card.

Other brands I have tried in the past were always buggy with some software or another, even though the software was within the capabilities of the video card.

The graphics either didn't run smoothly, no matter how you tweaked the settings, or the software needed to be patched for that specific card, or the software just plain didn't run at all, even though it was suppose to with the video card in question.

So my question is, if nvidia makes such a bad video cards, defects and all, then what does that make all of these other brands? Something much worse than just 'bad' I suppose.

posted by : Misato, 25 October 2009 Complain about this comment
Advertisement
Subscribe to the INQ Newsletter
Sign-up for the INQBot weekly newsletter
Click here to sign up Existing user
Advertisement
INQ Poll

Nvidia Fermi

Will graphics cards built with Nvidia's Fermi GPUs be a hit?