The Inquirer-Home

Oak Ridge is fixing interconnect bugs in Titan cluster

Hasn't passed acceptance testing
Wed Feb 20 2013, 09:57
nvidia-tesla-k20x-k20-gpu-accelerator

OAK RIDGE NATIONAL LABORATORY (ORNL) has admitted that its Titan high performance computing (HPC) cluster has yet to pass acceptance testing.

ORNL's Titan HPC cluster is at the top of the prestigious Top 500 list, however the cluster that sports AMD Opteron CPUs and top of the line Nvidia Tesla K20X GPGPU accelerator cards has some glitches in its interconnects.

ORNL has confirmed that the Titan cluster has yet to pass acceptance testing, citing problems with the interconnect fabric between the CPU and GPU components.

Jeff Nichols, chief of ORNL's scientific computing division said it is working with Cray to sort out the bugs. He told local rag Knoxville News, "We've found a few bugs that have held us back, and we're doing some repair work with Cray in order to get the stability tests where we want them to be."

According to the report, researchers can use the Titan cluster but only its CPUs, which means that most of the cluster's performance capacity is unavailable. Nichols said that the cluster was completing between 92 and 93 percent of jobs sent to it, which is just shy of the 95 percent level required to pass the acceptance test.

Cray told The INQUIRER that it had always expected the Titan cluster to pass acceptance testing in the second quarter of 2013, while Nichols said ORNL is hoping that the cluster will pass the test in March or April.

Nvidia would not comment on whether it is playing any part in the repair work. µ

 

Share this:

blog comments powered by Disqus
Advertisement
Subscribe to INQ newsletters

Sign up for INQbot – a weekly roundup of the best from the INQ

Advertisement
INQ Poll

Dead electronic devices to be banned on US-bound flights

Will the new rules banning uncharged devices be effective?