Military Embedded Systems

GPUs and InfiniBand accelerate high-performance computing


April 01, 2011

Graphics Processing Units (GPUs) can significantly accelerate high-performance computing applications, but only if the network interconnect delivers the performance to support them.

The High Performance Computing (HPC) market’s continuing need for improved time-to-solution and for the ability to develop and run higher-fidelity simulations seems unquenchable, requiring ever-faster HPC clusters. This has led many HPC users to implement Graphics Processing Units (GPUs) in their clusters. While GPUs were once used solely for visualization or animation, today they serve as fully programmable, massively parallel processors, allowing computing tasks to be divided and concurrently processed on the GPU’s many processing cores. When multiple GPUs are integrated into an HPC cluster, the performance potential of the HPC cluster is greatly enhanced. This processing environment enables scientists and researchers to tackle some of the world’s most challenging computational problems.

To obtain the best results, HPC clusters with multiple GPUs require a high-performance interconnect such as InfiniBand to handle the GPU-to-GPU communications and optimize the overall performance potential of the GPUs. Because the GPUs place significant demands on the interconnect, it takes a high-performance interconnect such as InfiniBand to provide the low latency, high message rate, and bandwidth needed to enable all resources in the cluster to run at peak performance. The following examines how GPUs accelerate HPC performance and why a high-speed interconnect contributes to that performance.

The rise of GPU computing

A GPU is a coprocessor originally designed to offload graphics calculations from the main CPU in a computer system. GPUs originally evolved for speeding up graphics rendering in simulation and gaming applications. But in the past few years, there has been a shift toward using GPUs for HPC. Today, three of the five fastest supercomputers on the Top500 list ( use GPUs to help achieve their high performance.

GPUs can deliver 10 to 100 times the performance of traditional, x86-based CPUs alone. In addition, GPUs also deliver greater performance per watt of power consumed. In fact, a GPU-based system from the National Center for Supercomputing Applications scored third on the Green500 list that ranks HPC systems by MFLOPS per watt, with 933.06 MFLOPS per watt.

As parallel processors, GPUs excel at tackling large amounts of similar data because the problem can be split into hundreds or thousands of pieces and calculated simultaneously. This capability delivers huge performance gains across a broad range of applications in biotechnology, seismic modeling, solids modeling, and other disciplines.

HPC clusters are incorporating 12 cores per node (Intel Westmere processors) and even 24 or 48 cores per node, and GPUs extend this paradigm with their integration of a large number of cores per GPU. To leverage the parallel processing power of GPUs, programmers modify portions of an application to take advantage of the GPU. Running a function on the GPU involves rewriting that function to expose its parallelism, then adding new function calls to indicate whether functions will run on the GPU or the CPU. NVIDIA simplifies the process of leveraging GPUs by enabling programming in many standard languages. One example of this is NVIDIA’s CUDA parallel computing architecture, which enables development in a variety of languages and APIs including C, C++, Fortran, OpenCL, and DirectCompute.

The CUDA architecture contains hundreds of cores capable of running many thousands of parallel threads, while the CUDA programming model lets programmers focus on parallelizing their algorithms and not the mechanics of the language.

Additionally, the latest generation CUDA architecture, code-named “Fermi,” incorporates more than 3 billion transistors.

Why the interconnect matters

HPC applications designed to take advantage of parallel GPU performance require a high-performance interconnect, such as InfiniBand, to maximize that performance. A rule of thumb for determining the bandwidth requirements to service a processor or GPU is to use a factor of 0.1 per GFLOPS to determine the bandwidth required for the interconnect to service the processing workload. Today’s NVIDIA Tesla 2050 GPU has a peak performance of approximately 500 GFLOPS, with an achieved performance of approximately 250 GFLOPS. This means that the GPU will require approximately 2.5 GB of bandwidth. This is why InfiniBand at QDR or 40 Gb speeds has become a preferred interconnect for GPU-based clusters when compared with 1 and 10 GbE. The bandwidth requirement is just one factor that will determine the optimal performance of the GPU cluster.

The following factors will also have an impact on GPU cluster performance:

·       Scalable non-coalesced message rate performance

·       Extremely low latency for collectives operations

·       Consistently low latency, even at scale



These factors are heavily influenced by the architecture of the InfiniBand interconnect. Specifically, using an InfiniBand that is designed from the ground up for HPC will further maximize the overall performance of the GPU cluster. Some test results illustrate this requirement.

The following tests show the difference between an InfiniBand implementation designed for HPC and one that was not. The tests were performed on NVIDIA Tesla 2050s interconnected with QLogic TrueScale QDR InfiniBand at QLogic’s NETtrack Developer Center. The Tesla 2050 results for the industry’s other leading InfiniBand are from the published results on the NVIDIA AMBER 11 benchmark site (

This first test is an indication of the performance of the GPU cluster without special code to optimize communications between each of the GPUs (Figure 1). One of the key challenges with deploying clusters consisting of multi-GPU nodes is to maximize application performance. Without special code like a version of GPUDirect, GPU-to-GPU communications would require the host CPU to make multiple memory copies to avoid a memory pinning conflict between the GPU and InfiniBand. Each additional CPU memory copy significantly reduces the performance potential of the GPUs by 30 to 40 percent.


Figure 1: GPU cluster performance without GPU code

(Click graphic to zoom by 1.9x)






The next test is an indication of the difference that the type of InfiniBand can have on the performance of the GPU cluster (Figure 2). As the graph indicates, the more the GPU cluster is scaled, the greater the performance difference between the two versions of InfiniBand.


Figure 2: Performance of different InfiniBand types

(Click graphic to zoom by 1.7x)






The final test is a good indication of the potential GPU cluster performance at significant scale (Figure 3). There are three major models in the AMBER test suite. These range in size at the low end with DHFR simulating 23,000 atoms, to FactorIX simulating 90,000 atoms, and finally Cellulose simulating 400,000 atoms. The DHFR test is the most strenuous on the interconnect at the scale that was tested, because a simulation this small requires proportionately less GPU processing but significantly greater interconnect workload. This type of ratio is a good indication of the performance of a larger model running on a larger GPU cluster. In this case, there is a 6 percent difference in the number of nanoseconds per day that can be simulated with the GPU-optimized InfiniBand interconnect.


Figure 3: GPU cluster performance at scale

(Click graphic to zoom by 1.9x)






GPUs mingled with InfiniBand for HPC

GPUs are making a significant impact on performance in HPC clusters, adding massive parallel processing capabilities to x86-based servers. When paired with a high-speed interconnect like InfiniBand, GPUs accelerate a broad range of applications. Meanwhile, the performance of the interconnect is key and has a significant impact on the performance of GPU-based clusters. An InfiniBand interconnect designed and architected for the HPC marketplace will maximize the performance of the GPU-based cluster.

Joe Yaworski is director of QLogic’s Global Alliance and Solution Marketing. Within his Global Alliance responsibilities, he manages QLogic’s strategic partnerships and alliances in the High Performance Computing market space. His Solution Marketing role is to help channel and alliance partners create solution marketing combined with QLogic’s HPC technologies. Joe also directs the QLogic NETtrack Developer Center, which tests and certifies partner applications and completes performance benchmarking.









Featured Companies


1750 East Northrop Boulevard, Suite 100
Chandler, AZ 85286