It is much more important to know what sort of patient has a disease than what sort of disease a patient has - Sir William Osler
DESIGNER OF HOT GRAPHICS CHIPS Nvidia might have an unlikely competitor thanks to the mess it made with Fermi.
Fermi chips on Nvidia's high performance computing (HPC) Tesla boards have disappointed, not just in terms of performance but more importantly with the firm's continued inability to rein in power consumption. The ill-advised design choices it made have, as The INQUIRER reported first, left some of its former loyal supporters such as Silicon Graphics International (SGI) looking elsewhere for viable alternatives.
For HPC vendors the Fermi problem isn't just its lower than expected performance but in the reality of data centres, where Nvidia's Fermi based Tesla cards limit the amount of hardware a customer can put in a rack before running out of cooling capacity. The Fermi architecture was designed to extend Nvidia's rapid expansion in the HPC market, so the news that the Green Goblin had to not only cut back on the number of 'Cuda' codes but increase the thermal design power (TDP) as well was a double whammy.
As Nvidia got too clever for its own good, it forgot what had made its GPGPUs so popular. SGI's senior director of server product marketing Bill Mannel told The INQUIRER that even though it had invested heavily in Field Programmable Gate Array (FPGA) through its RASC programme, when surveying alternatives, including the Cell architecture, Nvidia's Cuda represented the best fit for the firm at the time. In some ways SGI's decision was the correct one, as Mannel says that many of the other options have since "fallen away". Now there's a chance that Nvidia will join that list.
To understand why power draw and cooling are so important in a graphics chip that's destined for HPC environments, one must look at the mindset behind running equipment that is designed for remote administration. Those who have spent any time working in a data centre will attest to the fact that they are hostile places to work.
Servers and cooling equipment create an oppressive cacophony of noise and when combined with the low ambient temperature, which Fermi does so well to raise, even the simplest of tasks can become tedious. Sending an engineer can be especially costly given the growth in modularised data centres, allowing for equipment to be dumped in the middle of nowhere to make use of lower data centre costs.
For this reason, vendors such as SGI design servers so that the majority of maintenance can be done remotely, which leaves hardware failure as the primary reason to ever send engineers down to the data centre. However, that's expensive. Therefore when Mannel admits reliability is affected by "hot cards" such as Fermi, it's hardly surprising that SGI is thinking of jumping ship.
Thanks to its acquisition of Cray back in 1996, SGI has access to several exotic HPC cooling approaches. The same Cray engineers who designed some of the supercomputing icons from the 70s and 80s, according to Mannel, are still on the payroll. Coming from a company that was known for over engineering its products, those engineers, who Mannel calls "a conservative bunch" when it comes to pushing the thermal design envelope, are wary of Fermi, and rightly so. The engineers lighten up when it comes to 'cloud computing' clusters, presumably because a limited number of failures can be masked through the abstraction of quantity.
The aura surrounding Fermi cards is enough to instil fear into systems vendors. SGI not only has to put "an additional amount of work" in testing systems but even those which do not ship with a Fermi board have to be designed "with the knowledge that a [Fermi] Tesla board may be added". Given that Mannel already alluded that "hot cards" can lead to a "worse failure profile", how long can Nvidia expect vendors to go the extra mile to design, implement and service boards that give them so much trouble at every stage of their lifecycle?
The problem for Nvidia is not simply its inability to curb Fermi's heat generation but that Cuda, while capable of increasing performance by orders of magnitude, simply isn't the most 'natural' of coding ideologies, meaning it requires a high investment of time and consequently, money, to convert models and code from linear to parallelised programming.
Though Cuda is an improvement on FPGA in this sense, Mannel claims that SGI's team of "code profilers" still have to work hard with clients to produce "speed ups" in code. Refreshingly honest, Mannel spoke of an SGI customer who had spent a year optimising code only for it to eventually produce merely a doubling of speed once in production.
While the impressive performance gains touted by Nvidia can occur in theory, the reality is somewhat different. Mannel's team found the average speed up across the board is a modest 7x. Given the work SGI and others like it have to do in order to "engage in parallelism", having the bonus of a GPGPU that has a thirst for power that would put an African tin pot dictator to shame doesn't help matters at all. The hindrance is so great that, according to Mannel, SGI is seeing a surge in what can only be described as the antithesis of Tesla, which the firm calls "microservers".
Flogged by SGI under its Molecule brand, servers based on Intel's low power Atom chip provide a popular strategy for a number of SGI customers. Mannel says that some of its big customers have opted to try its Atom processor and claims that for "Apache and search workloads" the servers "did well" and provided "very good price/performance".
SGI claims to fit in the order of 10,000 Atom cores in a single 19-inch server rack and according to Mannel the firm's engineers are finding ways of increasing the density of Atom motherboards in what it calls a "tray", with each tray representing one rack unit (1U), the height of a 5.25" drive bay. With Intel marketing the Atom for "media consumption" devices such as netbooks, Mannel said that the chipmaker is "not particularly appreciative" of its use of Atom chips outside of netbooks.
Perhaps it is a sign of how it views Nvidia's GPGPUs that when asked how Intel is coping against Nvidia's onslaught in the HPC arena, Mannel merely said that both Intel and AMD "are still doing fine".
Sadly, while Nvidia CEO Jen-Hsun Huang continues to take pot-shots at Intel, he would do better by listening and understanding what his customers want. The news that SGI is looking at alternatives to Fermi is, in itself, far from surprising given Nvidia's dismal track record with the chip. What is surprising, though, is how long it has taken for an HPC vendor to come out and say that it's had enough.
Nvidia should, rightly, be credited for bringing the GPGPU market to a wider attention. However its Fermi folly has meant that the company that popularised a method of computation which holds great promise is facing a two-way threat from both AMD and Intel.
The scale of Fermi's failure can be judged by the fact that Nvidia's customers are seeing a surge in sales of servers based on a chip Intel doesn't even deem fit for the data centre. Ironically, thanks to its success with Cuda, Nvidia could see Fermi become an even bigger failure for it than NV30. µ
Sign up for INQbot – a weekly roundup of the best from the INQ