Jump to content
The Inquirer-Home

Supercomputing moves beyond MPI

Analysis Our cluster shares happy memories
Tuesday, 29 July 2008, 07:57

BY THE END of this year, the world's first (mostly) non-classified petaflop supercomputing monster should be unveiled - most probably in the United States, of course. And no, we don't mean petaflop using Cell processors or Tesla GPUs, but "real stuff" on real CPUs, in double precision, of course.

Let's say, for a second, that the chosen nodes are 3.2GHz dual-socket Gainestown Nehalems with the TylersburgDP chipset and those marvellous six channels of DDR3 memory. Put some of those 16 GB MetaRAM modules in there, and you got 192 GB of DDR3-1333 per node in 12 DIMMs. Not bad at all, even for eight threads handled in each system.

Back to the Flops. 102.4 Rpeak Linpack GFLOPs per node here mean something like 10,000 nodes for that petaflops - peak - performance mark. Even though each node can pack humongous memory amounts, there will be lots of apps benefitting greatly from sharing the stuff between nodes in an even larger single memory space: either a virtual application space like in Cray-style shmem, or coherent shared memory in ccNUMA.

MPI – which helped popularise "jack of all trades, master of none" Infiniband in the HPC space, as all that was needed was low latency and maxed out bandwidth – seems not to be able to effectively address the global shared memory opportunities, even with the MPI-2 extensions.

Simply, passing messages between nodes with distinct memory spaces, while working well in quite a few applications and problems today, may not compare well with variety of shared memory space approaches when it comes to the machines scaling to the thousands of nodes.

Imagine those 10,000 nodes, each with say 192 GB of MetaRAM, providing a petaflops of raw power and nearly two petabytes of RAM - in a single memory space! Even if just virtual application-level space like in shmem. Say, an Oracle or MySQL database having the essential data for every living person on Earth, all in one virtual memory space, with simple reads and writes, local or remote, low latency and high speed either way, finding anything about anyone in no time... a dream for any "Homeland Security" agency I guess.

How can it be so fast across so many nodes? Well, sub-microsecond adapter latency coupled with fat tree federated switch setups, where each hop is around 30 nanoseconds - at the end, the remote access penalty becomes tolerable enough to treat other nodes as extension of your own memory: every node could see the memory of the whole cluster ultimately.

So, at the end, you might end up taking just a microsecond to write a word across to a system 1,000 nodes away, or two microseconds to read its feedback back into your own memory, in a kind of VNUMA (Very Non Uniform Memory Access) patern huh. Makes a very interesting research topic - optimising the access patterns and minimising latencies across a petabyte or more of shared memory in thousands of tightly-connected nodes.

Talking about something interesting: unexpectedly to some, it is our fellow Europeans who are strong in this: Quadrics in old medieval Bristol, and Dolphin up in cold Oslo. The big US interconnect vendors were mostly focused on more plain vanilla Gigabit and Infiniband connections. But, sharing happy memories enables much tighter, more coupled and fullfilling interconnects with good benchmarking consummation outcomes.

Any thoughts, dear readers? µ

Share this:

Comments
good

The MPI is a very unintuitive standard and programming model. It would be about time something better was put in it's place.

posted by : Eugen, 29 July 2008 Complain about this comment
But will it be able to run Crysis?

And what score will it get in 3Dmark06?

Will it be able to run Vista smoothly too?

A real challenge, could it run Crysis and 3Dmark06 on Vista?

posted by : interested_party, 29 July 2008 Complain about this comment
hmmm...coincidence

Was just reading about X10 yesterday (IBM never could name things).

Has an even better model. Memory can be shared, but the language specifically distinguishes between local and remote (so that horrendous microsecond wait doesn't have to remove the peta from your flops).

posted by : Richard Henderson, 29 July 2008 Complain about this comment
Supercomputing is solving the real problems now!

MPI or other message based models is still very good. But fast emulation of shared memory have not evolved at the same pace.

We are starting to see a pattern where memory will come in a lot of different types. From the tiny internal shared memory in a Cell CPU, to quite slow, but massive flash storage PCIe cards.

This means very complex and fast system designs all over the place.

posted by : V, 29 July 2008 Complain about this comment
Re:could it run Crysis and 3Dmark06 on Vista?

Yes, but not at the same time

;-)

posted by : Pascal Monett, 29 July 2008 Complain about this comment
New parallel programming languages already on the horizon

There already are a bunch of PGAS (partitioned global address space) languages (e.g., co-array Fortran, UPC) that could potentially displace MPI, but vendor support for them has been weak. 

More promising, perhaps, is DARPA's sponsorship of 3 new languages - X10 from IBM, Chapel from Cray, and Fortress from Sun. See http://www.hpcwire.com/features/17883329.html . At least 1 of these could become a new "standard". 

Otherwise, I don't get your "cause and effect" relationships here. MPI didn't popularise Infiniband, and virtual shared-memory doesn't enable faster interconnects. If anything, it's the other way round on both counts.

I wouldn't get too carried away by virtualization here. Memory will still be physically distributed, and data will still have to move across interconnects. HPC programmers will ignore those realities at their peril.

posted by : Enda, 29 July 2008 Complain about this comment
do you really think so?

yah, like no one ever thought of using networks to virtualize shared memory before. but you seem to have dropped a few decimals in your estimates: memory is around 50ns, but networks are around 20x that. yes, modern networks make net-shared-memory (and ooold idea) more tolerable, but it still means pretty amazingly low performance with whole pages flying around willy-nilly. it's also worth remembering that current interconnects go to great lengths to avoid having to frig the MMU all the time, but net-sh-mem is hardly anything _but_ an MMU frigger.

but really, is your article just a teaser for ScaleMP?

posted by : mark hahn, 29 July 2008 Complain about this comment
Not so fast ...

Actually, MPI is still far better on "true" yet non-uniform shared memory machines, like SGI Altix or multi-socket Opterons. It really really really helps for the application to actually only allocate stuff in the nearby memory, and not somewhere 100ms down the network switch. Even Opterons work way better if MPI is used and which allocates stuff near this cpu, and not the other ones.

One cannot abstract away where the memory location is while keeping the performance, and so the programmer might as well be doing it him/herself.

posted by : hpc_user, 29 July 2008 Complain about this comment
Read Jack Dongarra's paper

http://www.netlib.org/utk/people/JackDongarra/PAPERS/adv-comp-darpa-08.pdf The conclusion I have come to is that one again Intel is behind the curve. As Jack notes in the above paper Linpack is now 15 years old and obsolete on a functional basis for super computing. HPCC is the current benchmark that measures 5 other benchmarks which have equal importance. Using Linpack alone which has been replaced by Global HPL is somewhat akin to timing a race hose with a sun dial. That is compounded by what appears at this time to be several ISA incompatibilities between Intel''s 64 bit architecture and X10, Fortress, and Chapel. Mike Wolfe gave an excellent paper at SC'06 on the compiler incompatibilities and how Intel specific optimization will cause IEEE754 compliant systems to crash. Error! Flie Reference Not Found. So software for Intel processors lacks transferability to other machines . That is something that DARPA requires. In Los Alamos studies , doubling the CPU or memory speed raises the heat dissipation load by 8. Power consumption of DDR3 is a real issue. Based on this summers power prices IBM's $1/watt is too conservative. Annual costs are headed for $1.5/watt based on the latest Western Area Power Administration price projections through March 2009. June hit $135/mwh at the generator. The last verified by a DOE National Lab number shows 12mflops/watt for Intel, 18mflops/watt for AMD and 100+ for he Blue GeneL/P series. Performance includes peripherals and supports services like AChttp://www.cs.berkeley.edu/~samw/research/papers/ipdps08.pdf. To be credible , it is going to have to demonstrate competitive performance in HPCC not just Linpack, it is going to have to meet the DARPA interoperability standards, and it is going to have to deliver much better power efficiency. That means performance at lower clock speeds. A frequency increase generates an exponential power increase.

posted by : Ed Hinders, 29 July 2008 Complain about this comment
Analysis of... what?

This article is full of random facts, but doesn't seem to actually have a purpose or a conclusion. Did you mean to suggest a technology that would replace MPI?

posted by : Tim, 29 July 2008 Complain about this comment
Not that simple

First, optimizing message passing is much easier than shared memory. Then there is the dirty memory issue. Oh and lets see, if a node fails the global memory breaks, the whole house of cards comes tumbling down. Ever calculate the MTBF for a 10,000 of anything?

Second, If you talk to most HPC users about your great idea, you will hear something like the following "Sigh, been there, done that, got the T-shirt, it don't work".

posted by : deadline, 29 July 2008 Complain about this comment
Wright Your Multi Thread Software....

Its Intresting that so many well written comments poped up on subject. Heres My lesser:

Arrghhh....Great White on drawing Board. My idea is, How about Multi Thread Software, Something Microsoft Research is Conferencing NOW. This is Much more Than Telecommunications Giant, especially name;Battlefield Mobi goes well with potential demands, In fact its more DMV tracker. yet how multi threaded can DMV Program Go?. 
Searching Out Great White Spermer anit easy, especially in Montana, where they'd thunked I stated that horrible term Spammer. Yet, Take this critter to Next level, Glacial & GO figure: Dunnington. Arrghhhh, with ALL Silver of West Glacier on Battlefield Mobi to Look Part of Pirate(HappyFace Pirate,Of Course), Them Deck Hands Be Amused. Perhaps Milatary Satelite System with Navigation & instant info/cross communications, worldwide.That'll Get Cloud Blower, Out There. 
Getting My Ahab Poon in Wineseller & use left side for Video, right side for Screams & Laughter with my final output one BIG Val. Arrghhhh. give me Super Mobi & give that Machine my dic, BET its worth bitty piece of eight.NEXT Battle:

Petaflop of spermers Vs. Petaflop of Nahalem. I, ahab still conquer, Its Dunnington Thing that Has all female Crew scared.

First Know Multi Threaded Commentos!!!

Via CRAY.Thanks Seymour, I Likes Name & I likes It White.
TS drashek

posted by : Sperm_Whaler, 29 July 2008 Complain about this comment
Do we get zero-point energy and cold fusion too?

A multi-petabyte, multi-petaflop shared memory system built from a few thousand COTS grade servers sounds great in *theory*. However, mixing a very large shared memory environment with COTS hardware is akin to using a road flare to peer inside a gas can. Shared memory systems to do not react well spontaneous loss of memory space, something that would happen fairly often with 10,000 COTS grade nodes strung up with IB or another hi-perf interconnect. 

Unless a very robust fault tolerance schema could be designed in (mirrored node address space, redundant MPIO, etc) one would be better off saving several million quid and just beat your head against the wall. Either way the result would be the same, might as well save the money.

posted by : Jeff Johnson, 30 July 2008 Complain about this comment
MPI Today, ??? Tomorrow

Bravo, nice conversation starter!

It has become clear that MPI, while effective and useful, presents a barrier to wider adoption of massively parallel systems. The PGAS languages may fix that: it is clear that something must be done. 

On the other hand, today's commodity interconnects aren't very good at moving small chunks of data (latency and overhead is high) don't perform well for random communication patterns, and frequently aren't scalable with respect to cost and reliability. And none of those problems have anything to do with the processors, except that faster processors place additional stress on the communication fabric, which is already overburdened in many clusters. 

Make no mistake, I like MPI -- the company I work for built a machine around the concept. But we've noticed that much of what you build for effective message passing is really important for the PGAS type programming models as well. (Conversely, if your fabric isn't very good at MPI, it probably won't deliver very good PGAS performance.)

(For more, take a look at http://www.bigNcomputing.org )


posted by : Matt Reilly, 31 July 2008 Complain about this comment
Re: New parallel programming languages already on the horizon

HPCS project began in 2003 ...
and Sun's Frotress dropped from HPCS in Phase III of the program.

See:
http://www.hpcwire.com/features/Suns_Fortress_Language_Parallelism_by_Default.html

I do not see any REAL new parallel programming languages in the horizon.


posted by : Ami, 01 August 2008 Complain about this comment
Vista?

Don't be stupid... NOTHING runs Vista smoothly, because Vista is such low quality code.

Vista cant thread properly on 2 cores let alone hundreds....

ROFL

posted by : 99flake, 08 August 2008 Complain about this comment
MPI tends to do better

The problem is synchronization of large shared memory multithreaded programs. MPI programmers tend to do better, because MPI queues decentralize contention. It's the same reason communist central planning doesn't work as well as a free market.

posted by : john1p, 10 November 2008 Complain about this comment
Advertisement
Subscribe to the INQ Newsletter
Sign-up for the INQBot weekly newsletter
Click here to sign up Existing user
Advertisement
INQ Poll

Browsers

Who will win the next round of browser wars?