SAN JOSE: SOFTWARE TRAINER Acceleware said at Nvidia's GPU Technology Conference (GTC) today that most algorithms that run on GPGPUs are bound by GPU memory size.
Acceleware is partly funded by Nvidia to provide developer training for CUDA to help sell the language to those that are used to traditional C and C++ programming. The firm said that most CUDA algorithms are now limited by GPU local memory size rather than GPU computational performance.
Both AMD and Nvidia provide general purpose GPU (GPGPU) accelerator parts that provide significantly faster computational processing than traditional CPUs, however they have only between 6GB and 8GB of local memory that constrains the size of the dataset the GPU can process. While developers can push more data from system main memory, the latency cost negates the raw performance benefit of the GPU.
Kelly Goss, training program manager at Acceleware, said that "most algorithms are memory bound rather than GPU bound" and "maximising memory usage is key" to optimising GPGPU performance.
She further said that developers need to understand and take advantage of the memory hierarchy of Nvidia's Kepler GPU and look at ways of reducing the number of memory accesses for every line of GPU computing.
The point Goss was making is that GPU computing is relatively cheap in terms of clock cycles relative to the time it takes to fetch data from local memory, let alone loading GPU memory from system main memory.
Goss, talking to a room full of developers, proceeded to outline some of the performance characteristics of the memory hierarchy in Nvidia's Kepler GPU architecture, showing the level of detail that CUDA programmers need to pay attention to if they want to extract the full performance potential from Nvidia's GPGPU computing architecture.
Given Goss's observation that algorithms running on Nvidia's GPGPUs are often constrained by local memory size rather than by the GPU itself, the firm might want to look at simplifying the tiers of memory involved and increasing the amount of GPU local memory so that CUDA software developers can process larger datasets. µ