The Inquirer-Home

Nvidia plays the meltdown blame game

Comment Story doesn't mesh with reality
Mon Jul 07 2008, 17:32

NVIDIA'S STOCK TOOK a long overdue beating the other day, more because Wall Street is collectively horrified that it has been lied to than any fundamentals that are public. That said, the 8K keeps up the firm's tradition of honesty and integrity.

The root of the problem is, so far, HP notebooks, but likely others. You can see the HP page here, and at least one lawsuit about the same thing here. No mention of this in the Nvidia statement though. Why would they? If you look at what Nvidia says, it isn't their fault, it is those damn suppliers.

The official line is: "While we have not been able to determine a root cause for these failures, testing suggests a weak material set of die/package combination, system thermal management designs, and customer use patterns are contributing factors". Parsing that, you see that they are blaming fabs and packaging suppliers first, OEMs second, and those damn users third, but they have no fault here, NV can do no wrong.

This is really dangerous for three reasons: they are annoying suppliers, annoying OEMs and annoying users. Last we checked, they need all three to remain in business.

The weak die/packaging excuse doesn't wash at all. Nvidia is blaming TSMC behind the scenes, trashing them pretty hard through 'unofficial' channels to deflect blame. They are likely to be doing the same to packaging suppliers as well, and others. The reason this doesn't wash is that there are only a handful of suppliers in each of these fields.

If they had a problem with Nvidia, there would be problems with other companies. ATI, Altera and dozens of others, would have chips crapping out left and right, especially designs where they are meant to run 24/7 like embedded parts. You would see an industry rife with failures and warning like the bad caps problem of a few years ago.

You simply aren't seeing that. Period. No warning from others, no recalls, no TSMC warnings, no nothing. This is a sham to deflect blame from Nvidia, they don't want to dent their shiny image, much less slow down the 'can of whoop-ass' opening. I am calling bullshit on the supplier-blaming problem.

Suppliers are a problem for Nvidia though, at least they are now. Trashing your suppliers like this is a dangerous thing to do, Nvidia needs them more than they need Nvidia. Can you imagine the scene at the next TSMC planning meeting where they are discussing who gets what allocation on the next tight process, and how much they pay?

TSMC Planner 1: How many wafers do we allocate for Nvidia a month?
TSMC Planner 2: The 40nm process is looking tight at first, do you agree?
TSMC Planner 1: Yeah, really tight.
TSMC Planner 2: Remember that time when NV was calling us [male rooster euphemism][oral suction euphemism]s to anyone who would listen? Wasn't that a fun time.
TSMC Planner 1: So 4 then?
TSMC Planner 2: 4K? That seems high.
TSMC Planner 1: No, 4.

Blaming your suppliers publicly is bad. When it isn't their fault, it is worse. Doing so in the sleazy backhanded ways that Nvidia knows so well is tantamount to corporate suicide. Suppliers will find a way to make you pay, and they will get the knife in somehow. Nvidia being bossy and arrogant only makes the situation more enjoyable for them. Look for this PR blunder to have massive long-term effects that manifest themselves in dropped margins, critical parts shortages, and missed deadlines. Bad move #1.

Bad move #2 is blaming the OEMs, this is done with the subtle phrase "system thermal management designs" in the 8K. This is engineering code for, "we didn't do anything wrong, those nitwits at HP did". It works like this, Nvidia makes a part and it has a variety of constraints it is meant to be used within. Things like power draw, minimum and maximum temperature, and other things.

NV specs these things, and HP makes a notebook to the specs that NV gives them, a process that happens long before the chips come out of the fabs in any decent volume. If the chips are within the promised specs, thing go well. If they are not, there are some tweaks you can pull, but if they are too far out of spec, you are basically screwed.

Now this assumes both sides are honest, and people are trying to solve problems, not deflect blame. Nvidia is really good at the latter, bad at the former. They also can't make a chip that isn't a blast furnace. Most of their recent woes, including the massively delayed current round of MCPs, is down to out of control thermals, just like the last round.

How do you fix a systemic design problem in silicon on a time scale that doesn't sink an entire season's notebook sales? Easy, you fudge the spec sheet. If you have a TDP of 20W for a part, and it is coming in at 25W from the fab, you can lower the speed or change what TDP means. If you promised HP a chipset that has an 800FSB and it can only hit 667, well, that is problematic. If you give them a chipset with a 20W TDP, and the definition of TDP changed between the last generation and this one, well, "that is how we do it now".

If it is HP incompetence as Nvidia is stating, then it would simply be a ca se of a line or two of notebooks that went bad. HP system engineering is one of the very best in the industry, period, subject to management whims. This is not to say they can't screw up, they most definitely can, but it is pretty rare on anything major. HP does seem to have QC process engineering down well.

Does this mean they are perfect? No, not even close. Have they screwed up on a notebook? Sure, probably several here and there over the past few years. If you look at the HP page, once again here, you will see there are 24 models affected. I can believe there are one, two, maybe four screwups, but 24 model lines all with the same problem? All with cooling related failures? All with cooling related video failures? All with cooling related video failures on Nvidia parts?

What NV is doing is smearing the good name of HP and it's engineers here. There is no way in hell that HP totally botched every Nvidia based notebook for a generation in the same way. Not a chance. This is once again a smear job, and it will once again come back to bite Nvidia in the bottom line, give it time. Companies like this have long memories. The only thing you can say from this is that it is not HP's fault.

Well, actually, you can say more. If HP specced cooling for a theoretical 20W, and the Nvidia chip puts out more than 20W, what happens is you get more heat in the system than you can get rid of, and temperatures slowly climb. It will either keep climbing, or level off, but likely it is out of the thermal bounds set by Nvidia. The system will get really hot or simply crash.

The problem? This puts them out of the thermal tolerances for the packaging. That is OK for short periods, but repeatedly staying above the limits causes the packaging material to degrade prematurely. Worse yet, repeated heating and cooling caused by the laptops heating up and then crashing, then being left off for a bit to cool and 'work again', is horrible for the packaging. This is how solder joints and bumps crack, and substrate warps. Coupled with weakened materials from overheating, and you have dead GPUs.

This is hugely unlikely to be a HP problem, or a substrate problem. It is most likely a bad engineering design decision that Nvidia tried to sweep under the rug. Sometimes it works, other times it doesn't. This time is an 'other', and companies like TSMC and HP don't like being publicly crucified for Nvidia's screwups. They really don't like it.

The third bad move is 'customer use patterns': so, it isn't our fault, it is those crazy kids! A Scooby Doo villain couldn't have said it better after a failed whoop-ass attempt. From the look of things, the customers are doing things like turning on and off laptops, something likely unanticipated by Nvidia product planners. I mean who does that?

Blaming customers would be bad move number three, but I doubt most of them will realise it is Nvidia's fault, they will blame HP or the host of other OEMs that haven't been named yet. Either way, if you take bad move #2 into account, if I were an OEM, I would tell everyone calling in for warranty support unequivocally that it is Nvidia's fault for supplying bum chips. In this case, it wouldn't be deflecting blame.

In any case, the 'crazy kids' blame game is pointless and will only hurt Nvidia if people hear it. They likely won't, but there is no upside unless they think analysts are several steps dumber than a slow sheep.

In the end, the whole thing can be summed up by bad engineering, covering your ass, and hoping it blows over. Nvidia corporate messaging is pretty much incompetent, more driven by the fact that they are pawns of people higher up the food chain than anything else, and they only have one tool, a hammer.

When something goes wrong, they don't know how to solve problems, only hit things. This situation was dealt with by surprising Wall Street with a collective kick in the hedge funds. There was no explanation, no softening of the blow, and no word to the press, just a 'Surprise, we are tanking' governmental form, followed by stonewalling and finger pointing at blameless people.

Botched doesn't begin to describe this response, but it is a good start. They utterly flunked Crisis Management 101. Given the last sentence of the 8K, " There can be no assurance that we will not discover defects in other MCP or GPU products," this is far from over. In fact, we know it is; there are many more lines and products affected.

Now that you know about how the Nvidia parts failed leading to the massive loss, plummeting stock, and management fast-talking, what everyone wants to figure out is where the buck stops. That is not a simple question, but several industry insiders have told us the same story, it all depends on who got burned, and how big they are.

The one we know about is HP, here and here, but it is far from over. Nvidia is chiming in now because it is very likely they are footing the bill for the class action settlement, or at least a very large chunk of it. When they gave the prescient advice that, "There can be no assurance that we will not discover defects in other MCP or GPU products", they aren't joking, this problem hasn't cropped up in desktop parts yet, but it most assuredly will. We are getting reports of other afflicted items, but it is premature to name them.

So, basically, Nvidia totally screwed up, and is blaming everyone but the one company they should, itself. The OEMs know it, consumers know it, suppliers know it, and since the "OMFG, our hair is on fire" performance of last week, just about the entire world knows about it. Everyone who has one of these parts will be seeking restitution, just watch the bills mount now that word has spread.

But that brings up the costs and payments. Nvidia took a $150-200 million hit initially over this, but what does that cover? Looking at Dell's web site, going from an integrated GPU to an external Nvidia GPU is either a $50 or $130 upgrade, maybe more on a low volume gaming part. That is what Dell sells the module for, plus profit and overhead. The chips that Nvidia sells, minus GDDR memory, construction etc, are probably in the $10-40 range.

If you look at that, there are three million or so parts affected, and can likely be fixed by swapping out an PCIe card. With chipsets, well, things get interesting , they are soldered to the mobo, as are many CPUs, especially in thinner notebooks. In this case, the replacement means a new mobo minimum, possibly a CPU thrown in for good measure.

Then there is the cost of fielding the support call, not a trivial matter for a dead notebook. Shipping the part back to the depot, labour to replace the mobo, and shipping it back as well. Added staffing to handle the returns of large portions of 24 notebook lines adds to the bottom line as well.

That leads to intangibles like customer ill will, lost productivity, and the odd executive who gets a bum laptop for their kids. You can't put a dollar value on these, but they do have an effect, much of Dell's current woes are due to treating customers like dirt three-five years ago.

So, once again, who pays for all of these costs? That is an unequivocal "it depends". Depends on how contracts are written, how much leverage the OEMs have, and how much good will Nvidia has built up.

On one side, you have Dell, one-time masters of the supply chain, and squeezers of every penny they can get. Industry insiders tell us that Dell will be billing Nvidia for everything, from bad GPUs, mobos, replacement costs, help desk, lawyers, and every truck roll needed to fix something in the field. If Nvidia wriggles out of paying for something, they will pay for it in other ways.

HP is a little more flexible, but since Nvidia has been effectively blaming their engineering for it, I can see how they would lean a bit more toward the " right royal bastard" side of things. They are close to Dell in what they will charge, but may let some minor things slide.

As you move down the food chain to smaller people mobo makers, Tier 2 computer makers, and even little shops, NV will disclaim more and more. Asus and Gigabyte will likely not get everything covered, not even close. Smaller board makers might get credit for the cost of MCPs and GPUs.

Unhappiness will abound. They will all get their pound of flesh, it may just take a bit of time. Lawsuits seem to have forced disclosure, and NV is still trying to spin, minimize the downside, and point fingers. This, however, is far from over. Look for desktops to be affected as well as discrete GPUs before this is over, most of them use the same ICs as the mobile parts.

There seem to be two currently-affected products, the low-end and the mid-range parts of the last generation. Depending on the failure rate, Nvidia could be looking to eat the majority of a generation's products plus the cost of things they were soldered to, and the tech school dropout used to screw new parts in.

This will be very ugly before it is done, very very ugly. Finger pointing early on and the blame game will only harden resolve on the other side, and add to costs. There go their cash reserves, we guess. It couldn't come at a worse time. Then again, doing everything wrong does have a cost. µ

 

Share this:

blog comments powered by Disqus
Advertisement
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

Advertisement
INQ Poll

Coding challenges

Who’s responsible for software errors?