CHIP DESIGNER Nvidia has announced the latest version of its Cuda toolkit for developing parallel applications using the firm's graphical processing units (GPUs).
New features of the Cuda tools 4.0 include auto performance analysis in the visual profiler, added support for Mac OS X, C++ with virtual functions and a new GPU binary disassembler. A release candidate of the Cuda toolkit 4.0 will be available free of charge from 4 March for those registered with the developer program.
It is not entirely clear if they are new, but according to Nvidia the three main features of Cuda 4.0 are support for peer-to-peer communication among GPUs within a single server or workstation, unified virtual addressing for main system memory and GPU memories, and open source C++ parallel algorithms.
According to the Green Goblin the open source C++ algorithms mean that "routines such as parallel sorting are 5X to 100X faster than with Standard Template Library and Threading Building Blocks".
As well as open source C++, Cuda 4.0 has OpenMPI to automatically move data from and to the GPU memory over Infiniband when an application does an MPI send or receive call.
It also has multiple CPU host threads that can share contexts on a single GPU and, according to Nvidia, a single CPU host thread can access all GPUs in a system. µ