• Home
  • News
  • Artificial Intelligence
  • Internet of Things
  • Open Source
  • Hardware
  • Software
  • Security
  • Resources
  • Industry Voice
  • SMB Spotlight
  • Newsletters
  • Resources
    • Inq-logo-120x194
      The new standard in wireless networks and supporting the future needs of clients

      802.11n is certainly not dead and whilst manufacturers are still recommending 802.11n deployments, enterprise IT managers should give some thought to and make plans for the eventual implementation of 802.11ac. This white paper discusses how 802.11ac is being designed to meet the demands of clients in the future, help you understand the technology, what is likely to happen in the transition from 802.11n to ac and how you can get ready to meet these new demands.

      Download
      Inq-logo-120x194
      A holistic view of application performance

      Enterprise organisations are constantly being asked to do more work with fewer people, as the size and complexity of infrastructure and applications continue to grow unabated. This guide is intended for companies, organisations, and IT professionals who are looking for a network and application monitoring tool that provides a holistic view of application performance, including performance monitoring, from the end user perspective.

      Download
      Find resources
      Search by title or subject area
      View all resources
  • Follow us
    • RSS
    • Twitter
    • LinkedIn
    • Newsletters
    • Facebook
    • Google+
    • YouTube
  • Newsletter
  • Industry Voice
  • SMB Spotlight
The Inquirer
The Inquirer
  • Home
  • News
  • Artificial Intelligence
  • Internet of Things
  • Open Source
  • Hardware
  • Software
  • Security
The Inquirer
  • Graphics

Why Nvidia's duff chips are due to shoddy engineering

Part Two The underfill

  • Charlie Demerjian
  • 01 September 2008
  • Tweet  
  • Facebook  
  • Google plus  
  •  
  •  
  • Send to  
0 Comments

This the second part of a series of three articles getting to the nub of Nvidia's graphics chip woes. The series is the result of months of research conducted by diligent INQhack Charlie Demerjian, despite an in-box stuffed full of abuse. Part One can be found here and Part Three is here.

GETTING BACK to the underfill, this is probably the key to the problem. There is one more property of underfill called the glassification temperature, Tg for short. Tg is not melting, it is more the temp that is goes soft and looses most of it's structural rigidity. The underfill that Nvidia used, Namics 8439-1 is what's called a low Tg material, and the Hitachi 3730 has a higher Tg.

To be fair to Nvidia, about the time when the G84 and G86s were hitting the market, high Tg underfills were pretty rare and new to the market. Low Tg underfills, such as the Namics material that NV used had been available for a while, and were 'known'. The last thing you want to do is put a high risk part on a new and market untested material, so it looks like they went with the safe choice, low Tg.

If Nvidia did their homework right, the Tg of the material should never be hit, the chip should always run below that temp, and the underfill should provide the mechanical support needed to keep the high lead bumps from fracturing. This is why you engineer, test, retest, simulate, pray a lot, and pick your materials very carefully.

Namics_temp_vs_strength_small

Namics 8439-1 underfill temp vs strength curve

Here is the Tg curve for Namics 8439-1. Let us be the first to say there appears to be nothing, repeat, nothing wrong with this material, it does exactly what it says it does. It starts to lose strength at about 60C and by a little over 80C it has 100 times less rigidity. Think going from hard plastic to jello. What temps do GPUs run at again? What is the Tj (transistor junction temperature) for them? Ooops. Big hundreds of millions of dollar ooopsie right here.

So, the failure chain happens like this. NV for some unfathomable reason decides to design their chips for high lead bumps, something that was likely decided at the layout phase or before because the bump placement is closely tied to the floorplan. At this point, they are basically stuck with the bump type they chose for the life of the chip.

The next choice was the underfill materials, and again, they chose the known low Tg part that had far less tolerances than the newer to the market high Tg materials. It was a risk vs risk proposition, likely with a lot of cost differences as well. They chose wrong, very wrong. The stiffness of the Namics material might be perfect below the Tg, but once you hit it, it is almost like it isn't there, and the stress transfers to the bumps while they are hot and weak.

Fanbois will cry that their $.23 temp sensor is reading much lower temps than that, so there is no way this could be an issue. Well, the temp sensors on many cards are not on die, much less between the die and the substrate. They are also cheap and notoriously inaccurate. To top it off, they only measure average temp across the chip, not hot and cold spots. If you look at the IR photo in the previous part of this story, you can see that if you move the sensor from the right side to the left, you will get vastly differing readings. In this case, a real current chip, it will vary by as much as 30C depending on placement.

Many people also don't realize that it is easier for heat to travel down through the pins, they are mini-heat pipes, than it is to cross the three thermal barriers (die -> thermal paste -> heat spreader -> thermal paste -> heatsink) to the heatsink. That means those little bumps take a huge thermal pounding, and are usually hotter than the surface of the heat spreader.

To make matters worse, the bumps that are under the hot spots get hotter still. Piling on the pain, they carry the most current, and the hotter things get, the more heat they generate, and the more resistance they usually have.

Could it get worse? Of course it could. Remember thermal stress? The expansion is highest at the point, wait for it, that is hottest. That would be under the hot spots, and it puts the most stress on the bumps that are weakest.

This is why you have to pick your underfill very carefully, you have to relieve as much stress as you can from the bumps. Too little and they go snap, and the chip dies. Too much and you pull the polyimide layer off and the chip dies. Basically, you go as stiff as you dare, then test the hell out of it under the conditions your simulations tell you will be present. Test, test, test, test or dies die.

When the underfill glassifies, it means you are at the hottest point on the die, the bumps that it is protecting are under the most heat, pulling the most current, and under the most thermal stress. If the underfill essentially turns to jello, it is very bad. If you compound that by using bumps that bond poorly to the substrate, it makes things worse. If those bumps are stiffer than the other option, it is worse yet.

Let's go down the checklist for Nvidia. High thermal load? Check. Unforgiving high lead bumps. Check. Eutectic pads? Check. Low Tg underfill? Check. Hot spots that exceed the underfill Tg? Check. If you are thinking this looks bad, you are right, expensive too.

If it was just as simple as the underfill glassifying, the parts would have never made it to market. It is much more complex than that. The problem with thermal stress is that it is somewhat additive, it weakens parts long before they actually break unless it is quite extreme.

An example of extreme thermal stress would be to take a glass cup, preferably non-tempered, and put it in the oven on max. Pull it out and drop it in a bucket of ice water, and voila, instant thermal stress demonstration. Wear eye protection. The thermal stress that the bumps see is much more like the fork example earlier, it gets weaker and weaker with each bend, until snap, black screen.

If you recall, the Nvidia parts are breaking at the bump to substrate connection. This is the weakest point in the chain, and it is where they made the worst possible materials choice. It is not really a surprise that it failed. It is simply shoddy engineering.

So, what can be done by Nvidia at this point? Well, changing to high Tg underfills is a start, as is changing to eutectic bumps. The high Tg underfill option has come down in risk substantially since the G84 and G86 parts were introduced, so that is a no-brainer, and guess what Nvidia did to the G86? And the G92 as well.

The problem of changing bump types is far thornier, but Nvidia is doing that as well. From the intelligence we have been able to gather, Nvidia has not made any power distribution changes to the parts, there is no power grid, no power plane, or no anything to protect the eutectic bumps from electromigration. They may be able to keep them under their current capacity, but by how much?

This is emblematic of the 'pants are on fire' school of engineering, and reports from inside Nvidia confirm that they are in full panic mode over this snafu. With short time horizons to fix a massive batch of defective parts, reliability engineering usually draws the short stick. µ

Part Three: The cock-up, is here

  • Tweet  
  • Facebook  
  • Google plus  
  •  
  •  
  • Send to  
  • Topics
  • Graphics
  • Charlie vs NVidia
  • Nvidia

INQ Latest

Galaxy S7 Edge leak
Galaxy S8 specs, release date and price

Note 7 owners in Korea offered chance to bag a discounted Galaxy S8 next year

  • Phones
  • 24 October 2016
Smashed iPhones
Thieves have stormed an Apple store and stolen Apple phones

Stole enough to have two each and give five away

  • Hardware
  • 24 October 2016
Software bug
Rowhammer: Memory chip flaw enables hackers to root Android devices

Hardware-based attack requires no software vulnerability or user permission

  • Security
  • 24 October 2016
rinder-bot
UCL creates AI 'lawbot' that rules on cases with surprising accuracy

But can it master the American Smooth?

  • Software
  • 24 October 2016
blog comments powered by Disqus
Back to Top

Most read

New Apple MacBook
MacBook Pro leak points to Skylake, 2TB SSD and Magsafe USB-C adaptor
Google Pixel XL display
Google Pixel price, release date and specs: Nougat duo launch in the UK
DDoS code
Dyn DDoS attacker used a huge Mirai botnet of unprotected IoT devices
Intel chip
What you missed in tech last week: Intel CPU flaw, Dyn DDoS, Surface Pro borkage
google-pixel-vs-galaxy-s7
Google Pixel XL vs Galaxy S7 review
  • Contact
  • Marketing solutions
  • Enterprise IT Events
  • About Incisive Media
  • Terms & conditions
  • Privacy policy
  • RSS
  • Twitter
  • LinkedIn
  • Newsletters
  • Facebook
  • Google+
  • YouTube

© Incisive Media Investments Limited 2015

© Incisive Business Media (IP) Limited, Published by Incisive Business Media Limited, Haymarket House, 28-29 Haymarket, London SW1Y 4RX, are companies registered in England and Wales with company registration numbers 9177174 & 9178013

Digital publisher of the year 2010, 2013 & 2016

Digital publisher of the year 2010, 2013 & 2016