WE SAW that the brand new June 2008 edition GPUs from both green and red camps have IEEE-compliant standard FP in both single and double precision. We also could see that, once you enter the GPGPU realm, some naming conventions change.
For instance, what was known simply as "shader" when doing 3-D graphics is now a "thread processor" with its own integer, FP and special purpose ALUs fed by a local register file. Eight of these, fed by local 16KB shared full-speed memory, form a "TPA - Thread Processor Array". The GTX280 and Tesla 10 have 30 of these TPA's, running at 1.3GHz on the GTX280 and a bit faster - required to cross the teraflop peak barrier - 1.5GHz on the Tesla S1070 rack mount unit. The GTX260 has a bit less, 24 TPA groups for a total of 192 "thread processors" .
On the ATI side, the 800 shaders in the Radeon HD4870 or FireStream 9250 are then called "stream cores". Note that the cheap HD4850 also has full 800 " stream cores" - Daamit didn't disable shaders on the economy-class card, just used lower-speed bins, an approach we prefer.
Not to forget, each TPA has one DP FP unit as well as a special function unit -
read graphics? It's a GPU, after all. But, since we got nearly an order of
magnitude peak performance penalty for using DP FP on Nvidia GPUs, and just a
bit less on the ATI's, how far can we go using their mainstay single precision
FP?
Believe it or not, quite far: start with many simulation routines - reverse time migration for oil and gas exploration seismic apps. Or, how about astrophysical stuff like black hole dynamics, one example Nvidia often mentions? And oh yeah, Microsoft Excel Monte Carlo simulations are unlikely to need double precision FP either - just like most other finance-related apps.
Another suitable approach in other apps that still prefer DP FP is " parametrisation". Say, you are looking for a particular solution, and have to explore a hundred different computation variants before you select the closest one for the final round. Simple: run all hundred cases quickly in single precision, as long as that precision loss doesn't significantly impact the final result. Nvidia was showing examples reaching two orders of magnitude improvements over the current best CPUs in some cases. Even if you take 100x with a big grain of salt, the 10x or more possible time saving may just make it worthwhile.
Then, once the closest run is selected, put it through the double-precision motion on that same GPU or, if need be, on the CPU. Overall, still manifold time saving.
Any image, audio and video processing, ray tracing and antivirus routines would also be perfectly happy with single precision FP, by the way. So, it's not all double precision, obviously.
In summary, as long as the memory access requirements of your app don't impede it, that peak teraflop in single precision may end up to be quite useful in reaching some really stratospheric performance, as long as the GPGPU is working on a large streams or arrays of data, without much unpredictable jumping around. That latter case, well, is something that CPUs are still way better at.
Next we look at the "Thread Processor" vs "Stream Core", followed by CUDA vs AMD SDK. ยต
i believe NV didn't disable the shader, rather, they can make those parts with 1-2 defective shader working and sell them.

GTX 280 GTX 260 9800 GX2 9800 GTX 8800 GTS 512 8800 GT 
Stream Processors 240 192 256 128 128 112 
Texture Address / Filtering 80 / 80 64 / 64 128 / 128 64 / 64 56 / 56 56 / 56 
ROPs 32 28 32 16 16 16 
Core Clock 602MHz 576MHz 600MHz 675MHz 650MHz 600MHz 
Shader Clock 1296MHz 1242MHz 1500MHz 1690MHz 1625MHz 1500MHz 
Memory Clock 1107MHz 999MHz 1000MHz 1100MHz 970MHz 900MHz 
Memory Bus Width 512-bit 448-bit 256-bit x 2 256-bit 256-bit 256-bit 
Frame Buffer 1GB 896MB 1GB 512MB 512MB 512MB 
Transistor Count 1.4B 1.4B 1.5B 754M 754M 754M 
Manufacturing Process TSMC 65nm TSMC 65nm TSMC 65nm TSMC 65nm TSMC 65nm TSMC 65nm 
Price Point $650 $400 $500 $300 $280 $170-$230

To Read, All first numbers are 280, next 260 number, then 9800X2, as in first list of Card Numbers each order is eaxactly same order of cards compared. 
Theres So Much to remember, Like heart, Brain, Kidney & lung transplant w/free Pancreas, Where Do You Start.
PHYS X is big Booster, Some Vantage Score (Whicher everone it is ops/s=125,000) from 20,000 range is unbelievable, yet apparently true. Just Having:PhysX is More Key Than Any ToolBar Hardware/Math report indicates. 4870 beats 9800 by $100 & higher Vantage number(S), so better 16 ROPS, Yet NO PhysX in ethier, How Amaturish. 280 does lots more with less stream processors than 9800 While stronger 4780 announce 970 Mhz/s Core coming. Go,Go Figure.Save Your Brains, you Might Need Them.
SomeOne Gonna Have to Mix Cake in Lotta varieties to Find true TOP.

Stole My Own Number chart. drashek
Finally some technical info here at the Inq, instead of the ordinary biased trash.
Try it ... on anything but the most trivial scenes it breaks horribly. Even double precision is only barely adequate in some cases (think intersecting isosurfaces)

For scientific apps, single precision is at best good enough for a rough guess of the answer. You can see this by glancing over papers in the relevant fields - the only people doing single precision work are the people using GPGPUs. And the only use their code gets is to show some factor of improvement for some particular configuration that doesn't break when using single precision. Anyone doing the "real" work in the field has requirements that can't be satisfied by single precision.

To put this in context, I'm in the group of people using (or trying to use) GPUs for scientific applications, rather than the group doing the real work. Most of my time is spent trying to find parts of the problem that take up a lot of time and don't break under single precision.
We are always faced with a crossroads when it comes to technological innovation. From 16-32-64 bit. From CLI to GUI. Yes you can include now from SP to DP. At this moment there are few things that will take advantage of double precision, but that's partially due to the lack of relatively cheap DP capable equipment. Particularly DP equipment that isn't a nightmare to code on.

The areas it will likely be used in, research, educational environs, et al, are not like the retail sector. The retail sector is too cheap to move from 32-64bit software & drivers right now. Look at the research realm, and ask yourself how many mid-high end research environs still use 32bit. I know of several friends in particle physics alone that switched before MS had even created XP x64 edition (much less Vista) because of the advantages. I know they will love DP in their calculations (as well as those involved in Fluid Dynamics and numerous other engineering fields).

The present always seems like enough, unless you were looking for more in the first place. The present always seems like enough, until you're dragged into the next generation and realize what you were missing. I doubt single precision will be enough for anyone that could, just could, benefit in a tiny from double precision.
This may be the most abreviation-soaked "what does it mean in the real world?" article ever. This is the kind of thing that gives techs a bad name. ;o)
You guys should educate yourselves before speaking as an authority on technical subjects. You simply miss the point in so many ways. But bottom line is that insiders know that there are many, many applications being written right now where IEEE double precision floats are reqired. You will see the first applicaitons in scientific and engineering apps, things like CAD model FEA analysis simply BEG for acceleration and have for years. Then there's image processing, NLP, etc etc etc. I don't have time to enumerate the many apps where this will be big but bottom line is that you are confused. Take a breather, get your college education you put off to surf the web and come back being able to talk sense rather than spew balony and we all will be the better for it.