The problem is synchronization of large shared memory multithreaded programs. MPI programmers tend to do better, because MPI queues decentralize contention. It's the same reason communist central planning doesn't work as well as a free market.
Don't be stupid... NOTHING runs Vista smoothly, because Vista is such low quality code.

Vista cant thread properly on 2 cores let alone hundreds....

ROFL
Re: New parallel programming languages already on the horizon
HPCS project began in 2003 ...
and Sun's Frotress dropped from HPCS in Phase III of the program.

See:
http://www.hpcwire.com/features/Suns_Fortress_Language_Parallelism_by_Default.html

I do not see any REAL new parallel programming languages in the horizon.

Bravo, nice conversation starter!

It has become clear that MPI, while effective and useful, presents a barrier to wider adoption of massively parallel systems. The PGAS languages may fix that: it is clear that something must be done. 

On the other hand, today's commodity interconnects aren't very good at moving small chunks of data (latency and overhead is high) don't perform well for random communication patterns, and frequently aren't scalable with respect to cost and reliability. And none of those problems have anything to do with the processors, except that faster processors place additional stress on the communication fabric, which is already overburdened in many clusters. 

Make no mistake, I like MPI -- the company I work for built a machine around the concept. But we've noticed that much of what you build for effective message passing is really important for the PGAS type programming models as well. (Conversely, if your fabric isn't very good at MPI, it probably won't deliver very good PGAS performance.)

(For more, take a look at http://www.bigNcomputing.org )

A multi-petabyte, multi-petaflop shared memory system built from a few thousand COTS grade servers sounds great in *theory*. However, mixing a very large shared memory environment with COTS hardware is akin to using a road flare to peer inside a gas can. Shared memory systems to do not react well spontaneous loss of memory space, something that would happen fairly often with 10,000 COTS grade nodes strung up with IB or another hi-perf interconnect. 

Unless a very robust fault tolerance schema could be designed in (mirrored node address space, redundant MPIO, etc) one would be better off saving several million quid and just beat your head against the wall. Either way the result would be the same, might as well save the money.
Its Intresting that so many well written comments poped up on subject. Heres My lesser:

Arrghhh....Great White on drawing Board. My idea is, How about Multi Thread Software, Something Microsoft Research is Conferencing NOW. This is Much more Than Telecommunications Giant, especially name;Battlefield Mobi goes well with potential demands, In fact its more DMV tracker. yet how multi threaded can DMV Program Go?. 
Searching Out Great White Spermer anit easy, especially in Montana, where they'd thunked I stated that horrible term Spammer. Yet, Take this critter to Next level, Glacial & GO figure: Dunnington. Arrghhhh, with ALL Silver of West Glacier on Battlefield Mobi to Look Part of Pirate(HappyFace Pirate,Of Course), Them Deck Hands Be Amused. Perhaps Milatary Satelite System with Navigation & instant info/cross communications, worldwide.That'll Get Cloud Blower, Out There. 
Getting My Ahab Poon in Wineseller & use left side for Video, right side for Screams & Laughter with my final output one BIG Val. Arrghhhh. give me Super Mobi & give that Machine my dic, BET its worth bitty piece of eight.NEXT Battle:

Petaflop of spermers Vs. Petaflop of Nahalem. I, ahab still conquer, Its Dunnington Thing that Has all female Crew scared.

First Know Multi Threaded Commentos!!!

Via CRAY.Thanks Seymour, I Likes Name & I likes It White.
TS drashek
First, optimizing message passing is much easier than shared memory. Then there is the dirty memory issue. Oh and lets see, if a node fails the global memory breaks, the whole house of cards comes tumbling down. Ever calculate the MTBF for a 10,000 of anything?

Second, If you talk to most HPC users about your great idea, you will hear something like the following "Sigh, been there, done that, got the T-shirt, it don't work".
This article is full of random facts, but doesn't seem to actually have a purpose or a conclusion. Did you mean to suggest a technology that would replace MPI?
http://www.netlib.org/utk/people/JackDongarra/PAPERS/adv-comp-darpa-08.pdf The conclusion I have come to is that one again Intel is behind the curve. As Jack notes in the above paper Linpack is now 15 years old and obsolete on a functional basis for super computing. HPCC is the current benchmark that measures 5 other benchmarks which have equal importance. Using Linpack alone which has been replaced by Global HPL is somewhat akin to timing a race hose with a sun dial. That is compounded by what appears at this time to be several ISA incompatibilities between Intel''s 64 bit architecture and X10, Fortress, and Chapel. Mike Wolfe gave an excellent paper at SC'06 on the compiler incompatibilities and how Intel specific optimization will cause IEEE754 compliant systems to crash. Error! Flie Reference Not Found. So software for Intel processors lacks transferability to other machines . That is something that DARPA requires. In Los Alamos studies , doubling the CPU or memory speed raises the heat dissipation load by 8. Power consumption of DDR3 is a real issue. Based on this summers power prices IBM's $1/watt is too conservative. Annual costs are headed for $1.5/watt based on the latest Western Area Power Administration price projections through March 2009. June hit $135/mwh at the generator. The last verified by a DOE National Lab number shows 12mflops/watt for Intel, 18mflops/watt for AMD and 100+ for he Blue GeneL/P series. Performance includes peripherals and supports services like AChttp://www.cs.berkeley.edu/~samw/research/papers/ipdps08.pdf. To be credible , it is going to have to demonstrate competitive performance in HPCC not just Linpack, it is going to have to meet the DARPA interoperability standards, and it is going to have to deliver much better power efficiency. That means performance at lower clock speeds. A frequency increase generates an exponential power increase.
Actually, MPI is still far better on "true" yet non-uniform shared memory machines, like SGI Altix or multi-socket Opterons. It really really really helps for the application to actually only allocate stuff in the nearby memory, and not somewhere 100ms down the network switch. Even Opterons work way better if MPI is used and which allocates stuff near this cpu, and not the other ones.

One cannot abstract away where the memory location is while keeping the performance, and so the programmer might as well be doing it him/herself.
yah, like no one ever thought of using networks to virtualize shared memory before. but you seem to have dropped a few decimals in your estimates: memory is around 50ns, but networks are around 20x that. yes, modern networks make net-shared-memory (and ooold idea) more tolerable, but it still means pretty amazingly low performance with whole pages flying around willy-nilly. it's also worth remembering that current interconnects go to great lengths to avoid having to frig the MMU all the time, but net-sh-mem is hardly anything _but_ an MMU frigger.

but really, is your article just a teaser for ScaleMP?
New parallel programming languages already on the horizon
There already are a bunch of PGAS (partitioned global address space) languages (e.g., co-array Fortran, UPC) that could potentially displace MPI, but vendor support for them has been weak. 

More promising, perhaps, is DARPA's sponsorship of 3 new languages - X10 from IBM, Chapel from Cray, and Fortress from Sun. See http://www.hpcwire.com/features/17883329.html . At least 1 of these could become a new "standard". 

Otherwise, I don't get your "cause and effect" relationships here. MPI didn't popularise Infiniband, and virtual shared-memory doesn't enable faster interconnects. If anything, it's the other way round on both counts.

I wouldn't get too carried away by virtualization here. Memory will still be physically distributed, and data will still have to move across interconnects. HPC programmers will ignore those realities at their peril.
MPI or other message based models is still very good. But fast emulation of shared memory have not evolved at the same pace.

We are starting to see a pattern where memory will come in a lot of different types. From the tiny internal shared memory in a Cell CPU, to quite slow, but massive flash storage PCIe cards.

This means very complex and fast system designs all over the place.
Was just reading about X10 yesterday (IBM never could name things).

Has an even better model. Memory can be shared, but the language specifically distinguishes between local and remote (so that horrendous microsecond wait doesn't have to remove the peta from your flops).
And what score will it get in 3Dmark06?

Will it be able to run Vista smoothly too?

A real challenge, could it run Crysis and 3Dmark06 on Vista?
The problem is synchronization of large shared memory multithreaded programs. MPI programmers tend to do better, because MPI queues decentralize contention. It's the same reason communist central planning doesn't work as well as a free market.
Don't be stupid... NOTHING runs Vista smoothly, because Vista is such low quality code.

Vista cant thread properly on 2 cores let alone hundreds....

ROFL
HPCS project began in 2003 ...
and Sun's Frotress dropped from HPCS in Phase III of the program.

See:
http://www.hpcwire.com/features/Suns_Fortress_Language_Parallelism_by_Default.html

I do not see any REAL new parallel programming languages in the horizon.

Bravo, nice conversation starter!

It has become clear that MPI, while effective and useful, presents a barrier to wider adoption of massively parallel systems. The PGAS languages may fix that: it is clear that something must be done. 

On the other hand, today's commodity interconnects aren't very good at moving small chunks of data (latency and overhead is high) don't perform well for random communication patterns, and frequently aren't scalable with respect to cost and reliability. And none of those problems have anything to do with the processors, except that faster processors place additional stress on the communication fabric, which is already overburdened in many clusters. 

Make no mistake, I like MPI -- the company I work for built a machine around the concept. But we've noticed that much of what you build for effective message passing is really important for the PGAS type programming models as well. (Conversely, if your fabric isn't very good at MPI, it probably won't deliver very good PGAS performance.)

(For more, take a look at http://www.bigNcomputing.org )

A multi-petabyte, multi-petaflop shared memory system built from a few thousand COTS grade servers sounds great in *theory*. However, mixing a very large shared memory environment with COTS hardware is akin to using a road flare to peer inside a gas can. Shared memory systems to do not react well spontaneous loss of memory space, something that would happen fairly often with 10,000 COTS grade nodes strung up with IB or another hi-perf interconnect. 

Unless a very robust fault tolerance schema could be designed in (mirrored node address space, redundant MPIO, etc) one would be better off saving several million quid and just beat your head against the wall. Either way the result would be the same, might as well save the money.
Its Intresting that so many well written comments poped up on subject. Heres My lesser:

Arrghhh....Great White on drawing Board. My idea is, How about Multi Thread Software, Something Microsoft Research is Conferencing NOW. This is Much more Than Telecommunications Giant, especially name;Battlefield Mobi goes well with potential demands, In fact its more DMV tracker. yet how multi threaded can DMV Program Go?. 
Searching Out Great White Spermer anit easy, especially in Montana, where they'd thunked I stated that horrible term Spammer. Yet, Take this critter to Next level, Glacial & GO figure: Dunnington. Arrghhhh, with ALL Silver of West Glacier on Battlefield Mobi to Look Part of Pirate(HappyFace Pirate,Of Course), Them Deck Hands Be Amused. Perhaps Milatary Satelite System with Navigation & instant info/cross communications, worldwide.That'll Get Cloud Blower, Out There. 
Getting My Ahab Poon in Wineseller & use left side for Video, right side for Screams & Laughter with my final output one BIG Val. Arrghhhh. give me Super Mobi & give that Machine my dic, BET its worth bitty piece of eight.NEXT Battle:

Petaflop of spermers Vs. Petaflop of Nahalem. I, ahab still conquer, Its Dunnington Thing that Has all female Crew scared.

First Know Multi Threaded Commentos!!!

Via CRAY.Thanks Seymour, I Likes Name & I likes It White.
TS drashek
First, optimizing message passing is much easier than shared memory. Then there is the dirty memory issue. Oh and lets see, if a node fails the global memory breaks, the whole house of cards comes tumbling down. Ever calculate the MTBF for a 10,000 of anything?

Second, If you talk to most HPC users about your great idea, you will hear something like the following "Sigh, been there, done that, got the T-shirt, it don't work".
This article is full of random facts, but doesn't seem to actually have a purpose or a conclusion. Did you mean to suggest a technology that would replace MPI?
http://www.netlib.org/utk/people/JackDongarra/PAPERS/adv-comp-darpa-08.pdf The conclusion I have come to is that one again Intel is behind the curve. As Jack notes in the above paper Linpack is now 15 years old and obsolete on a functional basis for super computing. HPCC is the current benchmark that measures 5 other benchmarks which have equal importance. Using Linpack alone which has been replaced by Global HPL is somewhat akin to timing a race hose with a sun dial. That is compounded by what appears at this time to be several ISA incompatibilities between Intel''s 64 bit architecture and X10, Fortress, and Chapel. Mike Wolfe gave an excellent paper at SC'06 on the compiler incompatibilities and how Intel specific optimization will cause IEEE754 compliant systems to crash. Error! Flie Reference Not Found. So software for Intel processors lacks transferability to other machines . That is something that DARPA requires. In Los Alamos studies , doubling the CPU or memory speed raises the heat dissipation load by 8. Power consumption of DDR3 is a real issue. Based on this summers power prices IBM's $1/watt is too conservative. Annual costs are headed for $1.5/watt based on the latest Western Area Power Administration price projections through March 2009. June hit $135/mwh at the generator. The last verified by a DOE National Lab number shows 12mflops/watt for Intel, 18mflops/watt for AMD and 100+ for he Blue GeneL/P series. Performance includes peripherals and supports services like AChttp://www.cs.berkeley.edu/~samw/research/papers/ipdps08.pdf. To be credible , it is going to have to demonstrate competitive performance in HPCC not just Linpack, it is going to have to meet the DARPA interoperability standards, and it is going to have to deliver much better power efficiency. That means performance at lower clock speeds. A frequency increase generates an exponential power increase.
Actually, MPI is still far better on "true" yet non-uniform shared memory machines, like SGI Altix or multi-socket Opterons. It really really really helps for the application to actually only allocate stuff in the nearby memory, and not somewhere 100ms down the network switch. Even Opterons work way better if MPI is used and which allocates stuff near this cpu, and not the other ones.

One cannot abstract away where the memory location is while keeping the performance, and so the programmer might as well be doing it him/herself.
yah, like no one ever thought of using networks to virtualize shared memory before. but you seem to have dropped a few decimals in your estimates: memory is around 50ns, but networks are around 20x that. yes, modern networks make net-shared-memory (and ooold idea) more tolerable, but it still means pretty amazingly low performance with whole pages flying around willy-nilly. it's also worth remembering that current interconnects go to great lengths to avoid having to frig the MMU all the time, but net-sh-mem is hardly anything _but_ an MMU frigger.

but really, is your article just a teaser for ScaleMP?
Yes, but not at the same time

;-)
There already are a bunch of PGAS (partitioned global address space) languages (e.g., co-array Fortran, UPC) that could potentially displace MPI, but vendor support for them has been weak. 

More promising, perhaps, is DARPA's sponsorship of 3 new languages - X10 from IBM, Chapel from Cray, and Fortress from Sun. See http://www.hpcwire.com/features/17883329.html . At least 1 of these could become a new "standard". 

Otherwise, I don't get your "cause and effect" relationships here. MPI didn't popularise Infiniband, and virtual shared-memory doesn't enable faster interconnects. If anything, it's the other way round on both counts.

I wouldn't get too carried away by virtualization here. Memory will still be physically distributed, and data will still have to move across interconnects. HPC programmers will ignore those realities at their peril.
MPI or other message based models is still very good. But fast emulation of shared memory have not evolved at the same pace.

We are starting to see a pattern where memory will come in a lot of different types. From the tiny internal shared memory in a Cell CPU, to quite slow, but massive flash storage PCIe cards.

This means very complex and fast system designs all over the place.
Was just reading about X10 yesterday (IBM never could name things).

Has an even better model. Memory can be shared, but the language specifically distinguishes between local and remote (so that horrendous microsecond wait doesn't have to remove the peta from your flops).
And what score will it get in 3Dmark06?

Will it be able to run Vista smoothly too?

A real challenge, could it run Crysis and 3Dmark06 on Vista?
The MPI is a very unintuitive standard and programming model. It would be about time something better was put in it's place.