BOTH NEW GPUS from both Nvidia and ATI - sorry AMD, Daamit, claim teraflops-class performance: OK, for the generic US$ 500+ GTX280, you'll need to clock it at 650MHz or above GPU and corresponding increased shader clock, around 1.5GHz, to get that performance.
As for the HD4850, it is advertised as the first graphics card with 1tflops single precision theoretical peak performance for less than US$ 200 - its faster cousin, HD4870, is rated at 1.2 tflops for less than US$ 300.
The "professional high performance computing" versions, the Tesla 10-series for Nvidia and Firestream for ATI, will differ by larger on-card memory to fit in the large HPC datasets, possibly slightly higher FP performance, and - of course - lack of graphics outputs. And higher price, natch.
Today, let's look at how the two compare on raw hardware capabilities and scaling. We'll be covering more on the increased application usage of GPGPUs despite their limitations, the programming pros and cons, as well the competition, in subsequent GPGPU-watch stories. Yes, ultimately, it may be more important who will run the new Photoshop filters or GPGPU antivirus first.
So, the landmark 1 Tflop milestone has been passed: how much of it do you really get at the end? A big chunk is determined, of course, by optimised libraries and the programming environment - what about the hardware limitations?
First, GPUs don't have nearly as much cache or other memory on chip as typical CPUs. The sustainable speed depends more on the memory bandwidth and the memory controller efficiency for long bursts, where the initial latency can be hidden - clever pre-fetching techniques can help, too.
Let's even say that either of the GPUs mentioned here could sustain 100 GByte/s read or write speed to their local memory when using those long burst transfers. To sustain, say, half a teraflop and just sending out one 32-bit single precision FP number per flop to the memory - not even counting the reads for the operands - would take two terabytes/s sustained memory speed if trying to sustain that speed across data in local memory. Doable? Maybe, with those proposed Rambus next-gen memories, but not with the current stuff. Working on loops within GPU chips internal memory would alleviate this.
Secondly, talk about the double precision FP: after all, while you can use single-precision in some tasks, or parametrisation for the bigger jobs, IEEE standard DP FP is still the mainstream of most scientific and engineering codes.
Now, Nvidia GTX280 (and equivalent Tesla card) offers DP throughput at 1/8 of the SP peak, i.e. just over 125 GFLOPs for the new Tesla 10 series or the GTX280 OC. The ATI 4800 series offers DP at 1/4 the SP peak performance, i.e. 300 GFLOPs on the HD4870. In either case, if solely dependent on the card memory throughput, it wouldn't be easy to get anywhere near that peak. But of course, hundreds of those stream processors in GPUs can hold some data in their local registers and shared memory, to be processed at full speed.
At the recent Tesla briefing, Nvidia suggested that their four-card, 16GB and four single precision tflops slim rackmount box should be able to get somewhere around 350 Linpack Rmax GFLOPs (measurable maximum) in double precision. For a box costing somewhere around US$ 9,000, that is a great number - as long as your app can get anywhere near that number. It is quadruple the speed of a dual-CPU overclocked 4GHz Skulltrail in that same Linpack DP - at about the same cost. The problem? Unless your app is CUDA coded for the GPU support, there will be far more software that can make use of 80 Gflops on that Skulltrail than 350 GFflops on the custom Nvidia box.
The GTX280 chips' wide 512-bit memory bus, considered a burden among gaming GPUs as it complicated both the die and board design, is a huge plus in technical computing use. Simply, for a given memory technology, when you need to max out the capacity, you'll at anytime have double the possible capacity - and bandwidth - with green goblin's cards. The new Tesla cards have 4GB GDDR3 RAM per card - even though more conservatively clocked due to the dual-rank mounting and higher loads, it still allows packing that much more data into the fast local memory rather than losing 10x performance when going over PCI-E to the system memory.
The ATI side compensates for the narrower 256-bit bus with faster GDDR5 memory, however, not only it cuts the maximum capacity by half, but also requires - still very rare - higher capacity GDDR5 memories if needing to go to 2GB or more memory.
So, from the raw hardware point of view, ATI offers higher peak SP and much higher peak DP flops, but its narrower memory bus could turn it into a little bit of a capacity expansion 'flop' for those computing apps in need of more on-board memory. How do the two, Tesla and Firestream compare internally at the chip level? What about the software? You'll have to wait to read all about that soon. µ
How do the two, Tesla and Firestream compare internally at the chip level? What about the software? You'll have to wait to read all about that tomorrow. µ

Man u suck .... I was wanting to know more than the title gave....

Yes all this and more can be gathered from the Cineplay 2.0 articles...
AMD/ATI has the possibility of using "clamshell-mode" for the GDDR5 memories and thereby have 16 mem-chips on a card.
So the 256-bit bus is not as big drawback as you seem to think.
One could argue that Nvidia could use a mem-controller for GDDR5@512-bit to reach double the memory against the competition, but they would need to put 32 mem-chips on the board for that.
Some how I would like to see them try. =)
No matter the definition the first Teraflop SP single card was the HD3870 X2.

The first single GPU to do it would be the RV770/HD4850

As for the memory scaling and size issue the effect of that won't be fully known until the benefits/limits of the R700's memory/xfire properties are fully divulged. The CrossfireX-Sideport and changes to GDDR5 and the PCI-Express bridge could make it a 2 Teraflop card with similar memory benefits.

The biggest problem for AMD is far fewer options for configuration than TESLA. Rankmount options need ot improve from DAAMIT.

It's far from over this round.
Nebojsa not wanting to seem too picky the peak DP rates for the GTX280 are actually 12th of peak SP (or 78 DP GFLOPS for the GTX280 and 90 DP GFLOPS for the higher clocked S1070 cards) because they count the second SP MUL in the peak SP figures but which cannot be dual issued with the DP MAD.

Similarly the peak AMD DP rates are I believe actually 1/5 of peak as it is the fat ALU that does the DP ops and so the peak figures are 200/240/ DP GFLOPS for the 4850/4870 and presumably 240 for the FireStream 9250 when it finally arrives.
I've heard quite the opposite that the fat SPU (or ALU as you've said) doesn't do any DP ops.

While the other SPUs in the US combine together to do DP ops. Therefore you exclude the fat ALUs, and combine the other ALUs, you get the same 1/5 DP FLOPS figure.
Much ink and little info. Please, improve your writing skills. It shows that you have had no training, or no interest on learning on your own how to convey information in written media. You wasted my time, and time is the most valuable asset a person can spend.