WHEN WE TOLD YOU about the 'bad bumps' in the Apple Macbook Pro 15-inch models the other day, we expected it to end there.
But as luck would have it, Nvidia pointed us to a much deeper problem that not only affects at least some of the Macbook Pro notebooks, but likely every other high Temperature of Glassification (Tg) underfill chip Nvidia makes.
To understand this article, you really need to understand the problem, so please read the technical three part series (Part 1, Part 2 and Part 3) explaining what the problem is and where it occurs.
Nvidia's current problem stems from its half-hearted response to its earlier problem by only changing the underfill. Nvidia said that's what it did, both near the end of our initial Macbook article and in a later Cnet article here.
In that, Nvidia's Mike Hara said, "Intel has shipped hundreds of millions of chipsets that use the same material-set combo. We're using virtually the same materials that Intel uses in its chipsets." Note the word 'virtually'. The problem with this statement - other than his analogy being misleading and not addressing Nvidia's chip design problem - is that 'virtually' in this case means Nvidia missed a key coating component in its revised chip engineering design. It is NOT the same material-set technology as Intel, AMD, ATI and everyone else we talked with uses. Unfortunately for Nvidia, the coating material it left out is critical for the life of the chip.
Before we break out the electron microscope again, we feel the need to point out some of the things that Nvidia managed not to talk about in its purported explanation of the fix. It is sad to have to point this out, but underfill does not crack, bumps do. The bumps that cracked did so for a long chain of reasons that are explained in my earlier three-part article linked above.
Nvidia changed one of the steps in the chain, and seemingly none of the others. This might change the frequency of the bumps cracking, for either good or bad, or it might not. It might also introduce a new and much more serious failure mode, and that is what we believe Nvidia did.
Underfill is basically a glue that surrounds the bumps, keeps them from getting contaminated, and keeps them moisture free. It also provides significant mechanical support for the chip that is crucial for enabling it to withstand structural stresses, which are primarily caused by repeated heating and cooling cycles during operation.
There are two properties of underfill, Tg and stiffness. Tg is the Temperature of Glassification, which means the temperature at which it loses all stiffness. Instead of thinking about it melting, think about it turning to jelly. Stiffness is how hard it is before it melts.
One unusual property of underfill is that its Tg is related to its stiffness. If you want it to glassify at a higher temperature, it will be stiffer to start with. Lower Tg, softer initial stiffness. When making a chip, you have to balance between making the underfill so elastic that it effectively does nothing and so hard that it rips the chip apart on first power up. If you do things right, you make it as stiff as you can, but not too stiff. If the underfill is too soft, it won't provide enough structural support to relieve the strain on the bumps; too hard, and it will damage the underside of the chip itself.
Let's move back to how a chip is made. You all know about a silicon wafer - it is a 300mm silicon disc that you essentially draw pretty patterns on. Modern chips have multiple layers of metal that make transistors drawn on the silicon, and on top of each other. You can see some of this in the microphotographs below.
Modern chips have multiple metal layers, eight is pretty common for devices like CPUs and GPUs. To prevent the layers from shorting each other out, there is a layer of insulation deposited between them - this is called the passivation layer. The resulting chip is a relatively thick hunk of silicon with a 16-layer or so sandwich on top that goes metal/passivation/metal/passivation and so on. It ends up looking like a Roman aqueduct in a cross-sectional view.
An Intel 90nm CPU sliced
In a very simplified explanation, the more insulating you make the passivation layer, the faster the chip can work. This means low-K materials like Black Diamond are really useful, but they are also very fragile. You might have eight of these layers, and they have holes punched through to allow communication between the layers. The structure isn't all that strong to begin with, and the holes don't help. On top of the sandwich, you have an outer coating, usually Silicon Nitride (SiN), which is basically a hard ceramic shell that protects things.
Remember, these devices are called flip-chips because when they come out of the fab, they are flipped over, and the bumps go on what was the top. This is then covered with underfill and soldered to the substrate, the green fiberglass thing that most people think of as a 'chip'. The former top during fabrication is then the bottom after packaging, and the underfill touches the substrate and the SiN layer.
Because the SiN layer is pretty stiff, any strain on it will be transferred into the layers of the chip itself fairly directly. If there is too much strain, the layers of the chip peel apart and you have what is called catastrophic inter-layer delamination, and that kills the chip even deader than cracked bumps.
This means you have to change the passivation material to a stronger substance to take the stress. Unfortunately, the passivation layer isn't just an option you can readily change out on an already designed chip. Different choices in the passivation layer have cascading effects in the chip design and manufacturing process. This is complicated by the fact that there aren't that many viable choices to begin with. What you end up with is a limit on the stiffness of the underfill. This is why Nvidia didn't just crank up the underfill Tg a year ago - it has very serious consequences, most of them fatal to the device, and there are limited underfill options for a given passivation layer material.
A good analogy is a light bulb and a steel plate - light bulbs are fragile, steel plates are not. If you hit a light bulb with a hammer, you get lots of little pieces, but a steel plate will shrug it off. If you put a steel plate on top of a light bulb, carefully, and hit it with a hammer, you will not damage the plate, but the bulb will shatter just as if you hit it directly. This is very similar to how the strain within a chip assembly gets transferred, and the chip is basically a multi-layer light bulb and steel plate sandwich.
Luckily for chipmakers, there is a third option that allows you to have a fairly stiff underfill and not tear things apart. It is called a polyimide layer (PI), and it is a relatively thick - we are talking µm here - coating that you put on top of the last passivation layer. The PI layer is kind of rubbery. It absorbs some of the strain so the passivation layers don't have to, and it also distributes it over a wider area.
In essence, the PI layer simply protects the chip more. This allows you to use a stiffer underfill and not tear things apart. Notice I said stiffer, not solid steel. If you go too far with a stiffer underfill, you will transfer too much strain, and the chip will still die. The PI layer gives you a bit more leeway, taking more stress off the bumps, but you still have to choose very carefully and test the results to an amazingly high degree.
In the Cnet article, Hara said Nvidia changed the underfill, and we will assume that he meant Nvidia stiffened it, not made it softer. Softening it would only increase the problems they had with bump cracking, and while we may not hold Nvidia engineering in all that high regard, we can't assume they are abjectly stupid. So, Nvidia changed the underfill to a more 'robust' version, and didn't change anything else. We actually believe this story, mainly based on the parts we have dissected.
All is well, right? Ride off to the coffee shop in the sunset with your new Macbook happily working, Nvidia chips not dying in large numbers. However, there is only one tiny problem with that ending.
The Problem In Pictures
Remember when we said that Nvidia engineering wasn't abjectly stupid? Scratch that. Remember when we said we were going to break out the electron microscope? It's time. Remember the part about the PI layer being necessary for stiffer underfills? Guess what?
A test chip with a SiN layer
What you are seeing is the top of the bump, where it contacts the chip. The round light grey part on the bottom is the bump, the darker gray on the top is the the silicon itself. The spotty stuff above the top yellow line is the transistor and passivation layer sandwich - the aqueduct - and the dark grey area on the right is the underfill.
This chip, a materials test part, has no PI layer, just a SiN coating. You can see that the SiN coating is not even 2µm thick - it is the dark line that crops the top of the bump and ends at the pad on the chip.
For those of you who have been paying attention, you may notice some clumping in the bump material - it is eutectic, not high lead, and the clumping is a result of enthalpy. This is a thermal test chip, not a production part, used for heat cycle testing. That is why the bump material clumped, repeated heat cycles.
A test chip with a PI layer
This next one has the same major components, but you will notice the SiN layer is much thicker, 5 or more times, almost 10µm. That is because it not only has the SiN layer, but it also has a PI layer to absorb stress. This chip is also a test vehicle, and has eutectic bumps and a higher Tg underfill. We can conclude from this that a typical PI layer is 5µm or more thick, and a SiN layer is visibly thinner. Things may change depending on the fab, materials used, and intended use, but the rough thicknesses won't change much.
The bump from a Macbook Pro 15-inch 9600 GPU
Last up, we have an close up of the bump from the Macbook Pro's G96/9600 GPU. It is a high lead bump with, according to Nvidia, a higher Tg underfill. This means that the SiN layer should be under 2µm thick. Check, it is. Then the PI layer should be another 5+µ or so. Che.... Hey, wait a minute, there is no PI layer! No, really, it is not there.
Yeah, you are thinking right, Nvidia simply forgot the one critical layer to make its much vaunted, and on the surface correct, high Tg underfill work. To that, all we can say is that it does indeed seem so. If anyone has a better explanation, and several packaging engineers I talked with did not, feel free to chime in, my email is at the top of every article.
What this looks like is that Nvidia traded a bump cracking problem for an inter-layer delamination problem. Both lead to a term that semiconductor people call catastrophic failure, something you don't need an engineering degree to understand.
According to multiple packaging people contacted about this story, all of whom want to remain anonymous, this is a much worse problem than bump cracking. Phrases like "abject stupidity" and "how the [fsck] did they miss that" were tossed around, but still, they did.
In these conversations, several scenarios were put forward to explain it. None of them posit that it won't be a problem, they all say that it will, they were simply grasping at straws to say how Nvidia missed this one.
The first scenario theorizes that Nvidia had a bunch of high lead wafers sitting in inventory. When it first learned about the problem, it stopped bumping the chips because it knew where the problem lay, just not why. When the engineers got the go-ahead to restart the line with high Tg underfill, they had to use up a few months worth of wafers. Because a PI layer can't be applied after the wafer is fabbed, they were stuck, so they crossed their fingers and hoped someone like me wouldn't notice. I did, and if everything we hear is true, Macbook Pro owners and a lot of others will also eventually notice, as well.
The next theory is slightly more plausible - that Nvidia didn't have time to properly test. A heat cycle test of packaging material takes about three months to do, and you can't really rush it. If the first new parts started rolling out of the fab on July 1, 2008, the first day of Q3, and it takes about three months to set up and qualify a new fab process, that means the fab had to go into production setup on the first day of Q2.
Subtract out a further three months to thermal stress test the solution and Nvidia had to have started that around the first day of Q1/08, meaning that its engineers would have had to flip the switch on testing with a New Year's hangover. If the bump cracking problem was discovered in the fall of 2007, maybe even late summer, there was only one quarter to figure out what the problem was, research alternatives, and make test structures. There could not have been time for a second round of tests unless Nvidia knew about the problem far in advance of what HP and Dell admitted to.
The most likely way this would have played out is that Nvidia tested the structures, and none worked out well. Its engineers gritted their teeth and took the most promising option, no PI. The other scenario is that Nvidia didn't figure it out early, and was rushed to come out with a 'fix' because Jen-Hsun had to file an 8-K and let the public know. Not having an answer and a fix in hand would not have been compatible with executive egos, so the engineers came up with an answer, but they couldn't definitively say that it would work.
In either case, the length of testing time required is probably what bit them. It is a long and intricate process to stress test chips like this correctly. Nvidia has shown with the initial bad bumps problem that it botched that across multiple generations, so why should we give them the benefit of the doubt this time? The more interesting question is, when did it know what?
Next up, we have the long shot scenario, that Nvidia packaging engineers, if they actually have them rather than outsourcing everything, simply missed an entire branch of science. They all took a class on semiconductor engineering, but they all slept through that day. And didn't read the book.
One last thing to toss into the mix, cost. The PI layer is expensive, it adds about $50 to the cost of a wafer. Wafers from TSMC on a high end process cost about $3,000 to $5,000 depending on a lot of details. Adding the PI layer increases the cost of silicon by a noticeable amount, and adds to the defect rate.
For cards that sell to big OEMs for $30 or so, silicon can't be more than a few dollars of the total. Adding 25 cents to the cost of a chip is a big deal, it can mean the difference between profit and loss for the entire run. One engineer suggested that Nvidia might have shot down the PI layer on cost grounds, but we don't buy that. They weren't that desperate, were they?
What does this mean? Unlike what Nvidia has been implying, we have never stated that the 'bad bumps' in the Macbook Pro 15-inch would cause a failure. We simply stated that it is using the same material that caused failures in the older Macbooks, several HP and Dell lines, and likely many more that Nvidia has not admitted to publicly. The consumer has a right know this about the products they are buying, and Nvidia steadfastly refuses to tell them.
This time, we see a potentially much more serious problem, and no doubt it will be explained away with pseudo-science and sound bites. Tame journalists and bloggers won't bother to question the science, won't understand it, and will take the easy, canned explanation at face value. No problem will ever be admitted to, and the problems that Macbook and other computer owners encounter will be something else, a rare anomaly, a one-off, trust them. Really. Apple did.
Once again, this is not saying that the Macbooks will fail, or that the one you have will fail. We are simply stating that, according to all the packaging experts we talked with, none of them could come up with a scenario where this is not a massive problem. Once again, time will tell.
In the best of half-hearted PR speak, the Nvidia rebuttal (see Cnet link above) claims my initial investigation of the 'bad bumps' was "already flawed." Nvidia won't say how my analysis was flawed, but it tosses that out in an attempt to tarnish the evidence. It also won't say what parts are affected, so there is no way to tell for sure. If I am so wrong, why cover it up?
As for all high lead bumps being bad, that is simply not true, not once did I say that. I stated that given a chain of engineering failures, bad choices, and inadequate testing, these parts are failing. There is a long chain of events that causes the failures. Read the three part technical explanation linked above for more.
Nvidia is claiming that it changed the underfill material, and had Dawn sprinkle a little green fairy dust on them, and all is better. Every engineer I talked with disagrees. It is clear that they missed a critical step in making these chips, so changing a single step in the chain will very possibly make matters worse.
If you look at what the higher Tg underfill does, it moves strain off the bumps, and puts it on the SiN layer, which transfers it to the fragile passivation layer. Nowhere has Hara said that Nvidia attempted to reduce the strain that causes the failures in the first place, much less accomplished that goal. In fact, he admits the opposite, unless I misinterpret the statement, "What we did was, we just simply went to a more robust underfill." This is a band-aid, applied by a fairy, sprinkled with pixie dust. Sadly, it does not appear to be a thoroughly engineered fix.
Hara said, "The material set (combination of underfill and bump) that is being used is similar to the material set that has been shipped in 100's of millions of chipsets by the world's largest semiconductor company (Intel)." In saying that, Nvidia was right, it is similar. Similar is NOT the same, and the devil truly is in the details. He is right that every semiconductor manufacturer that uses a high Tg underfill uses a similar recipe, but all of them that I talked with, every single one, also uses a PI layer. Period.
The Man Behind The Curtain
Last up, Nvidia is strongly hinting, like in this Gizmodo article, that there are some mysterious, nefarious forces behind my reporting, and that electron microscopes are hard to come by. The implication is that I couldn't pull The Big Picture Book of Science out of a paper bag with a map, flashlight and guide dog.
It may be true that I am not up on the latest techniques at the cutting edge of electron microscopy, but my years of college - going from chemical engineering, to chemistry, to biology, to genetics - weren't a total waste. Reading the output from a spectrograph isn't that tough when you have been holed up in a lab doing similar work with related devices for years.
That brings up the crack about electron microscope scarcity. They really aren't that uncommon, it's just that Hara probably doesn't know where to look for one. I live quite close to the University of Minnesota, and last time I attended courses there many years ago, there were lots of them sitting around, some better than others.
Every major semiconductor design house has at least one electron microscope, likely many. They are indispensable research tools. How many does Nvidia own? I don't have a clue, but stories like this don't seem to imply that they are all that uncommon. In fact, I have seen dozens in tours of companies around the valley. In defense of Mike, he is an investor relations executive, and the SEMs at Nvidia are probably on a floor without executive washrooms.
Hara blames Nvida's competitors for being behind the story, and that is quite plausible on the surface. Really, Nvidia is cuddly, nice and honest, right? So who wouldn't like them? I mean, Nvidia openly declared war on Intel. It goes out of its way to antagonize AMD, treats the press like dirt, and plays its partners off against each other. A better question would be, at this point, who actually likes Nvidia? If you answer Joel Turnipseed, the guy in Iowa who lost all short term memory in a car accident in 2004, you might have the one.
One other thing that Hara doesn't appear to realize is that there are a few dozen teardown houses within an hour's drive of his office. Companies like Nvidia use them all the time when they want plausible deniability, a 'second opinion', or to dodge some trade secret laws. In fact, most semi companies use them regularly.
Some of them are public, others less so. A quick search for 'chip reverse engineering' should net you a dozen or so in very little time. To quote a friend from a large semi house, "The good ones don't have names."
What they do have, however, is a lot of expensive equipment, like the electron microscopes that are so craftily hidden at Nvidia headquarters. They also know how to use them well. One last thing, their business is quite 'peaky' - when a new chip comes out, they may tear it down, or tear down a few, and make a report. These reports sell for a lot of money, and that tides them over until a new part is released. In between busy times, some of them sit around bored, throwing darts at pictures of their former employers, while some stay busy 24/7. It simply depends.
What it comes down to in the end is that there is simply no shortage of companies, large and small, public and shadowy, that do teardown work. It really isn't all that hard. There is also no shortage of companies that dislike Nvidia - when a company sets out to piss everyone off, it often succeeds. The list of capable organizations with motives is not short, in fact it is very long.
Then again, it was my idea to begin with. When a company responds to an easy direct question with dodgy doublespeak, or answers another seemingly related question instead, alarm bells go off. Having solid information about the chips before you ask the question aids immensely in analyzing the PR/IR output. The bells went off this time, and the digging started. Several 'mad scientists' liked the idea, and agreed to help out as time permitted. It took two months, but the results were worth it.
In the end, what you have once again seems to be a massive engineering failure. This could, but not necessarily will, lead to inter-layer delamination failures. The Macbook Pro 15" GPU undoubtedly has the problem, and it is very likely that every Nvidia chip with high lead bumps and high Tg underfill does as well. We are still analyzing the eutectic bump parts, and will follow up with a report if we discover anything conclusive.
Nvidia is still stonewalling the first problem, and likely won't admit to this one unless they are forced by law to file an 8-K once again. Remember, the last admission was not voluntary. Once again, we will state the obvious: Nvidia has to come clean over this, admit what models are affected by the bump cracking, what computers the chips went into, and what chips are affected by this latest missing layer. Then the customer can decide. µ
Note: Apple was again called twice prior to publication and informed that there is a potential problem. Instead of calling us back to tell us that they knew about the issue, and had dealt with it, or would stand by their customers, Apple simply ignored us once again. Because of this, we award Apple the Steve Jobs Memorial Turtleneck for Pride and Arrogance (SJMTPA) for turning an opportunity to respond positively to this situation into mud. Own goal guys, zero for six!
We should be shocked, but...
But the search giant has now squashed the bug
But it's not yet available here in Blighty