DESIGNER OF HOT GRAPHICS CHIPS Nvidia might have an unlikely competitor thanks to the mess it made with Fermi.
Fermi chips on Nvidia's high performance computing (HPC) Tesla boards have disappointed, not just in terms of performance but more importantly with the firm's continued inability to rein in power consumption. The ill-advised design choices it made have, as The INQUIRER reported first, left some of its former loyal supporters such as Silicon Graphics International (SGI) looking elsewhere for viable alternatives.
For HPC vendors the Fermi problem isn't just its lower than expected performance but in the reality of data centres, where Nvidia's Fermi based Tesla cards limit the amount of hardware a customer can put in a rack before running out of cooling capacity. The Fermi architecture was designed to extend Nvidia's rapid expansion in the HPC market, so the news that the Green Goblin had to not only cut back on the number of 'Cuda' codes but increase the thermal design power (TDP) as well was a double whammy.
As Nvidia got too clever for its own good, it forgot what had made its GPGPUs so popular. SGI's senior director of server product marketing Bill Mannel told The INQUIRER that even though it had invested heavily in Field Programmable Gate Array (FPGA) through its RASC programme, when surveying alternatives, including the Cell architecture, Nvidia's Cuda represented the best fit for the firm at the time. In some ways SGI's decision was the correct one, as Mannel says that many of the other options have since "fallen away". Now there's a chance that Nvidia will join that list.
To understand why power draw and cooling are so important in a graphics chip that's destined for HPC environments, one must look at the mindset behind running equipment that is designed for remote administration. Those who have spent any time working in a data centre will attest to the fact that they are hostile places to work.
Servers and cooling equipment create an oppressive cacophony of noise and when combined with the low ambient temperature, which Fermi does so well to raise, even the simplest of tasks can become tedious. Sending an engineer can be especially costly given the growth in modularised data centres, allowing for equipment to be dumped in the middle of nowhere to make use of lower data centre costs.
For this reason, vendors such as SGI design servers so that the majority of maintenance can be done remotely, which leaves hardware failure as the primary reason to ever send engineers down to the data centre. However, that's expensive. Therefore when Mannel admits reliability is affected by "hot cards" such as Fermi, it's hardly surprising that SGI is thinking of jumping ship.
Thanks to its acquisition of Cray back in 1996, SGI has access to several exotic HPC cooling approaches. The same Cray engineers who designed some of the supercomputing icons from the 70s and 80s, according to Mannel, are still on the payroll. Coming from a company that was known for over engineering its products, those engineers, who Mannel calls "a conservative bunch" when it comes to pushing the thermal design envelope, are wary of Fermi, and rightly so. The engineers lighten up when it comes to 'cloud computing' clusters, presumably because a limited number of failures can be masked through the abstraction of quantity.
The aura surrounding Fermi cards is enough to instil fear into systems vendors. SGI not only has to put "an additional amount of work" in testing systems but even those which do not ship with a Fermi board have to be designed "with the knowledge that a [Fermi] Tesla board may be added". Given that Mannel already alluded that "hot cards" can lead to a "worse failure profile", how long can Nvidia expect vendors to go the extra mile to design, implement and service boards that give them so much trouble at every stage of their lifecycle?
The problem for Nvidia is not simply its inability to curb Fermi's heat generation but that Cuda, while capable of increasing performance by orders of magnitude, simply isn't the most 'natural' of coding ideologies, meaning it requires a high investment of time and consequently, money, to convert models and code from linear to parallelised programming.
Though Cuda is an improvement on FPGA in this sense, Mannel claims that SGI's team of "code profilers" still have to work hard with clients to produce "speed ups" in code. Refreshingly honest, Mannel spoke of an SGI customer who had spent a year optimising code only for it to eventually produce merely a doubling of speed once in production.
While the impressive performance gains touted by Nvidia can occur in theory, the reality is somewhat different. Mannel's team found the average speed up across the board is a modest 7x. Given the work SGI and others like it have to do in order to "engage in parallelism", having the bonus of a GPGPU that has a thirst for power that would put an African tin pot dictator to shame doesn't help matters at all. The hindrance is so great that, according to Mannel, SGI is seeing a surge in what can only be described as the antithesis of Tesla, which the firm calls "microservers".
Flogged by SGI under its Molecule brand, servers based on Intel's low power Atom chip provide a popular strategy for a number of SGI customers. Mannel says that some of its big customers have opted to try its Atom processor and claims that for "Apache and search workloads" the servers "did well" and provided "very good price/performance".
SGI claims to fit in the order of 10,000 Atom cores in a single 19-inch server rack and according to Mannel the firm's engineers are finding ways of increasing the density of Atom motherboards in what it calls a "tray", with each tray representing one rack unit (1U), the height of a 5.25" drive bay. With Intel marketing the Atom for "media consumption" devices such as netbooks, Mannel said that the chipmaker is "not particularly appreciative" of its use of Atom chips outside of netbooks.
Perhaps it is a sign of how it views Nvidia's GPGPUs that when asked how Intel is coping against Nvidia's onslaught in the HPC arena, Mannel merely said that both Intel and AMD "are still doing fine".
Sadly, while Nvidia CEO Jen-Hsun Huang continues to take pot-shots at Intel, he would do better by listening and understanding what his customers want. The news that SGI is looking at alternatives to Fermi is, in itself, far from surprising given Nvidia's dismal track record with the chip. What is surprising, though, is how long it has taken for an HPC vendor to come out and say that it's had enough.
Nvidia should, rightly, be credited for bringing the GPGPU market to a wider attention. However its Fermi folly has meant that the company that popularised a method of computation which holds great promise is facing a two-way threat from both AMD and Intel.
The scale of Fermi's failure can be judged by the fact that Nvidia's customers are seeing a surge in sales of servers based on a chip Intel doesn't even deem fit for the data centre. Ironically, thanks to its success with Cuda, Nvidia could see Fermi become an even bigger failure for it than NV30. µ
Tags: Nvidia
The quality of the reporting at the Inq continue to decline. This article expounds nothing beyond the author's deep ignorance of HPC.
In case you haven't noticed, the HPC world moves on far slower timescales than the latest gamers. We don't update our chips every six months; the cycles for validation, testing and running are years long. We also appreciate the difficulty of making a chip like Fermi, and a few months aren't going to kill anything. CUDA is a new architecture, and while it's difficult, it's less awkward than programming for Cell, which hasn't taken off as much as it could have.
But using Atom for HPC? You've got to be joking. Even Intel's old Pentiums (look at them in the ULV segment) offer much better performance per watt. And the interconnects between the bandwidth (which is ENORMOUS within the GPU) would have to be replaced with ethernet-style connections between the Atom processors, which are orders of magnitude slower.
No one in discussing using Atoms in any serious way for HPC calculations. Webservers, perhaps, but the idea that Fermi will be replaced with Atoms is utterly ridiculous. The "journalists" here ought to learn something before spewing yet another moronic diatribe against NVIDIA. Just because you hate them doesn't mean you know anything.
TNC is right.
To save face, Nvidia might have to cannibalize potential Fermi sales and perhaps position Tegra to be used in the areas that its customers may want to use Atoms.
Although Tegra is not an x86 based processor... from what I have read performace and energy consumption is far better than at least the initial Atom processors. I don't know how much that has changed, but if Nvidia makes Tegra an option within its HPC stategy, it might save a little face and maybe patch some holes to give it time to heal.
IMHO Fermi will benefit Nvidia in the long run; when the manufacturing process shrink finally catches up to Nvidian ambition. When will that be? - Who knows! Maybe Fermi on 25nm process might see some good results.
Too bad all that thermal engery cannot be reused - to heat homes - ha! Okay bad joke... sorry.
I find your criticisms very much in keeping with the tone of the original piece - uninformed and hostile. Of course I did read the article, it wasn't very long, particularly for something that was supposed to be "in depth". The only reference to a non-HTPC/netbook application of the Atom processor was to the SGI product aimed at the cloud computing segment. This was it. This was the sum total of the evidence presented that Nvidia was "losing on the hpc front". I repeat, the SGI Molecule servers are not designed for HPC applications, nor are they marketed for this use.
That the article is hostile toward Nvidia is evident by the tone: "designer of hot graphic chips", "mess it made with Fermi", "disappointed", "inability", "ill-advised", "Green Goblin", "double whammy", "too clever for its own good", "African tin pot dictator", etc. There was not an analysis, this was a rant.
I love these wanna be Nick/Lawrence trolls. Too many commas? Oh my! Lets call the journalism police. They have no right to get poor Gordon and tnc's panties in a bunch like that! BAD journalist, BAD! No pudding for either of you!
Gordon, try informing yourself on a subject before posting. You'll look like less of an ass that way.
tnc, if you'd READ the article, instead of just jumping to the bottom to post your retarded claims, you would have known both of your comments are wrong.
Now ladies. put away your keyboards. fix your makeup and try not to look so eager to follow poor Nick and Lawrence around. Or keep it up. @Phil and I could always use a laugh.
Nvidia's strategy of overclocking chips so they break both power and tdp envelops and lead to failures reminds me of pentium 4 prescot cpu's. I have seen loads of notebooks fail when the nvidia chips inside failed due to overheating. Also I love fanless hpc solutions where there is no failure due to fans and and evergrowing noise of the fans. Maybe time to run graphics in emulation on ARM based cpu's with transmeta like solution. While others like intel are producing multicore low power solutions , Nvidia is still engaged in high power single core solutions which only look good on desktops with water cooling or other extreme cooling solutions. Fermi or Tegra they all will fail due to hot chips :-)
which is why all these vendors are rushing to introduce products with Nvidia GPUs. No one uses Atom servers in HPC and no one uses a GPU to run a web server - these are two different market segments.
Another uninformed rant from El Inq.
Intel has put latency defect in ATOM's DNA to keep bread and butter for xeon running.
--------------
http://www.extremetech.com/article2/0,2845,2362982,00.asp
The researchers dubbed the flaw "query latency," and noted that the Atom's performance suffered when activity suddenly spiked. A sudden threefold burst of search queries resulted in the Atom failing to provide a search result 22.4 percent of the time before a "cutoff latency," the time in which the search engine aggregator would give up and move on to another microprocessor. Complex searches slow the Atom even further, they concluded, while the Xeon remained unaffected.
That meant that, to maintain that critical quality-of-service level, each Atom server has to be overprovisioned, with additional chips added to minimize the load on each individual processor. To compensate for user spikes and search complexity, 28.6 percent more Atoms would need to be added to the data center, trimming the Atom's cost and power advantage by 25 and 12 percent, respectively.
The Atom's advantages were also mitigated by the fixed power requirements of a data center. To match the throughput of a datacenter full of Xeons, the researchers found that seven times as many Atoms would be required, increasing the total power consumption to three times that of the Xeon and well above a typical power budget.
1: The article is a comment article hence it is the journalist's opinion on the subject. It's not meant to be written from a neutral standpoint it's an opinion.
2: The article which you clearly failed to grasp the point of isn't about nvidia graphics cards but their GPGPU cards that are being sold to HPC centres.
3: If you had any idea what you are talking about you see that the article isn't really very biased at all. Namely the cards being 6 month late, way over they're power budget, underperforming in both clock speeds and in even the flagship model having to have cores fused off and extensive overheating.
This is all common knowledge to anyone following the subject and shouldn't have to spealt out every time someone reads an article on it for the benefit of the ignorant like yourself.
As for the commas, who, gives, a, toss, mate?
I love these "comment" articles. Rather than starting from a neutral standpoint and looking at the facts to support a conclusion they always start with a conclusion for which the author digs up whatever "facts" they think support it, ignoring anything else. Ferret and his Lawrence Latif sock puppet are by far the worst offenders for this.
I have done no research into the latest gen nVidia cards so I have no opinion on the issue either way, nor am I especially interested at the moment. If I was interested in a new graphics card, I would not rely on the opinions of anybody who writes articles for the Inq to decide which one to buy. With utter morons like Ferret and Latif allowed to post, how trustworthy can their colleagues be? Maybe that's a bit unfair to the other Inq writers, but it's always the most vocal members of a group that shape your view of it, and Ferret and Latif are very vocal indeed.
Oh, and pro tip for you, Mr Latif, if that is indeed your name, if you use, four, commas, in a sentence, you're, DOING IT WRONG! Jesus Christ in a brothel, you'd think somebody with pretensions of being a journalist would have a basic grip of high school level grammar! You're not a journalist mate, you're a blogger with an overinflated opinion of himself.
After a flop like this, Nvidia's next GPU is probably going to pwn. It'll have to, or else they're going to go the way of the American economy... down, down, down.
I'll say it again, as I said in my post yesterday concerning Nvidia's release of its mobile Fermi and the fact that it is HUGE and HOT, this is a great opportunity for AMD, especially concerning their Fusion APU.
Imagine a rack with...say...4,096 4 core Fusions each with a DirectX 11+ and OpenGL 4+ GPU in CPU silicon directly tied together with a MUCH LOWER TDP than could ever be DREAMED of with a rack of 4,096 CPU's AND 4,096 Nvidia GPUs.
Now imagine same rack after ex Ageian founder, ex PhysX inventor, ex Nvidia CUDA head Manju Hegde gets a hold of AMD's Fusion Program (which he has just been name head of BTW)....
Oh, yeah...the tide has turned I believe...we will see.
Oh..for the record, I'm not an AMD fanboi. I have one old IBM Thinkpad T42 with an ATI Radeon Mobilty 9600. The rest of my computers have Nvidia's and all my Brainstorm boxes for 3D graphics have Nvidia Quadro 5500fx's (soon to be replaced with 5800fx's).
They sure are nice on bitter cold days.