The Inquirer-Home

Inside Supercomputers: results out in Top 500

The Alpha does well, well gosh what a surprise
Sat Jun 28 2003, 20:10
THE LATEST SuperComputer 2003 Summer show just ended in MittelEuropa, bringing along the fresh half-yearly TOP500 supercomputer list update. Aside from the still unchallenged NEC Earth Simulator proprietary and darn expensive parallel vector supercomputer in the Land of the rising sun (but setting Sun?), the elite section of the list is now dominated by tightly coupled large clusters of standard 64-bit and sometimes 32-bit servers.

For instance, the #2 is ASCI Q at Los Alamos Labs, the world's biggest publicly known Alpha system, based on 2,048 quad-CPU (or 8,192 CPUs total) HP Alpha ES45 1.25 GHz nodes running Tru64 UNIX. The performance is set at At 20.48 TFLOPs peak and 13.88 TFLOPs maximum in Linpack.

Each CPU has a humongous 16 MB cache running at half the CPU frequency. Why so much cache? Well, there is a total of 33 TB RAM in this cluster, and you need big caches for any kind of efficiency there.

Then the #3, a MCR Linux cluster at Lawrence Livermore Lab, with 1,152 dual 2.4 GHz Xeon nodes and 4.6 TB RAM, for a total of 11.2 TFLOPs peak and 7.63 TFLOPs obtained in Linpack. Then at #6 its IBM-manufactured sibling at the same site (used for Blue Gene/L related development), with 960 nodes (1920 Xeon CPUs) with peak and Linpack TFLOPs scores of 9216/6586.

Then, at #8, we move back to the 64-bit arena, with the world's fastest Itanium cluster at Pacific Northwest Labs. Composed of 770 dual 1 GHz McKinley HP systems, it reaches 6160/4881 peak and Linpack TFLOPs figures. It is joined at #9 and #10 by two more Alpha clusters, one in Pittsburgh Univ, another in French Atomic Energy Agency.

Now, what's common among all these monster machines? Well, the interconnect that makes them tick and produce those stunning results out of a stack of more-or-less normal servers. Believe it or not, it comes from Bristol, the home of the Transputer (anyone still remembers that little wonder?).

Quadrics, now owned by Finmeccanica from Italy, is, together with EV-7 Alpha and Hypertransport in Opterons, the spiritual successor to the Transputer, and quite a few Transputer guys are in fact there. They were about the first with SPARC-based shared-memory commodity clustering interconnect in the late nineties, but in the usual British fashion, the excellent technology somehow had neither excellent marketing not excellent (Ed: you mean any?) government backing like its US counterparts - Finmeccanica was the white knight that came to the rescue and there comes the Quadrics of today, the leading large-cluster interconnect and the only one at that scale with distributed shared memory capability across many thousands of nodes.

Well, some would say then that good old Digital Equipment (let it RIP, poor soul, stabbed from the inside?) was kind of a British company then? Ah, it is understandable - after all, it was based in "New England"...

Without going into details, distributed shared memory allows the programmer to address RAM on the other systems in the cluster as an extension of the node's own RAM, forming a linear address space that could total the sum of all memory spaces in the cluster - so, in a 64-bit cluster with 128 systems with 16 GB RAM each, you could see up to 2 TB memory space seen as available for your task. Of course, the "remote" memory is quite a bit slower than the local one, so clever programming needs to be there to minimise the penalty.

In some tasks like quantum modeling or combinatorial chemistry, "shmem" shared memory approach beats MPI message passing up to several times in actual performance on large datasets - partly because with MPI, you can't directly see the memory on other nodes in the cluster, and got to partition or replicate the tasks and data, then pass messages in between to exchange operands or results.

Quadrics claims leading performance per rail on both MPI and shmem, and runs on Alpha, Intel (all the Pentia and Itania) and soon AMD (where is POWER?? - I thought they are in need of a good interconnect) server stuff. The real limit is now PCI and even PCI-X bus, as the clustering fabric becomes faster than the I/O bus, and more direct system connections, like POWER5 GX+ bus, AMD HyperTransport, and Alpha IO7 bus, become necessary to lower the latency and maximise bandwidth.

I'm right now playing with a small Xeon cluster where we'll check some of the performance scalability stuff on Quadrics, Gigabit and possibly Myrinet, the chief competitor of Quadrics, next month. µ

Share this:

Comments

There are no comments submitted yet. Do you have an interesting opinion? Then be the first to post a comment.

aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

Authorities in several countries raided Megaupload recently, shut down all of its services, seized hundreds of servers and arrested several of its executives on criminal charges.

Do you think the move was justified?