If power is the problem, can GPGPU be the solution?

Story

December 16, 2011

Peter Thompson

Abaco Systems

The requirement for maximum processing performance has driven the growing use of GPGPU technology in a range of mil/aero applications. However, future developments in the GPU space look as if they will also address the requirement for optimum performance/watt.

General-Purpose computing on Graphics Processing Units (GPGPU) has been around for several years in the commercial domain, and has been deployed in the rugged mil/aero space since 2009. Engineers are comfortable with recasting compute problems to fit the programming paradigm, and fully rugged, conduction-cooled platforms are mature and in the field.

GPUs have already found a home with those pushing the envelope on performance and Size, Weight, and Power (SWaP). They have been deployed effectively in radar processing systems, shrinking the processing footprint for modes like Synthetic Aperture Radar (SAR) imaging and Ground Moving Target Indicator (GMTI) applications, as well as in Signals Intelligence (SIGINT) systems. Beyond this, they are naturally very effective in high-bandwidth imaging applications, such as wide-area surveillance, where hundreds of cameras are mosaicked into one and processed. Hyperspectral imaging also is a prime GPGPU application, because of the greatly increased bandwidth from sensing hundreds or thousands of bands and the dense operations for classifying hyperspectral data.

The next wave of GPGPU devices is about to hit the marketplace, and these latest devices show a new focus on power consumption. Just as with “traditional” processors, the emphasis has increasingly shifted toward performance per watt in order to enable new classes of application. Whether it is measured as GFLOPS per watt or picoJoules per instruction, and whether the technology is being applied to battery-powered night vision goggles or to DARPA-hard Exascale Grand Problems, the amount of power consumed and the amount of heat to be rejected are increasingly important. A shift in emphasis from GPU vendors reflects this and will enable new applications at both ends of the spectrum of system size.

GPU performance/watt is the new focus

When Jen-Hsun Huang gave the keynote speech at NVIDIA’s GPU Technology Conference in 2010, one thing of particular interest was that the GPU road map he showed depicted not performance, but rather, for the first time, performance per watt (Figure 1).

Figure 1: NVIDIA’s CUDA road map shows increasing performance per watt

(Click graphic to zoom by 1.9x)

This is a significant acknowledgement from a market segment that until now has been notoriously insensitive to power. One of the driving forces at play is the widespread adoption of GPGPU in Petascale High Performance Computing. As an example, the current fastest supercomputer in the world today uses more than 7,000 GPUs to yield more than 2 PetaFLOPS (1,015 floating-point operations per second), but at the expense of around 4 MegaWatts of power consumption.

NVIDIA is a member of one of four consortia awarded contracts by DARPA to study architectures that can reach Exascale: an increase in performance of 1,000 (1,018 floating-point operations per second). One of the challenges is how to achieve this without an accompanying increase in power consumption of three orders of magnitude. It is becoming increasingly apparent that the mil/aero embedded computing segment will benefit from developments in this arena.

When GE Intelligent Platforms released a range of products based on NVIDIA’s GT240 GPU, such as the IPN250 (Figure 2), this represented a significant sweet spot on the performance per watt curve for rugged deployment. Even the next generation of GPUs, the Fermi family, struggled to match the GT240 for efficiency. The versions that boasted more performance typically came at a significant power penalty. For example, the GT240 peaks at 385 GFLOPS with a maximum Total Dispersed Power (TDP) of 45 W, which equates to 8.6 GFLOPS per watt. The Fermi-class GF104 peaks at 590 GFLOPS with a maximum TDP of 75 W or 7.9 GFLOPS per watt. Note that here, only Single Precision (SP) floating-point operation is considered, as the GT240 does not support double precision, and embedded computing tends to be dominated by SP operation.

Figure 2: GE’s IPN250 combines a single GT240 96-core CUDA GPU with an Intel Core 2 Duo host processor operating at 2.26 GHz and 8 GB of DDR3 SDRAM to deliver up to 390 GFLOPS of performance per card slot.

(Click graphic to zoom by 1.9x)

There are still reasons to consider Fermi devices beyond the GFLOPS per watt metric. Substantial architectural changes include:

Improved double precision support and performance for applications that require it
Error correcting code memory to improve reliability
Increased shared memory size
True cache hierarchy to ease programming complexity
Faster context switching
Improved scheduling
Increased autonomy from CPU

Road map shows fivefold improvement

NVIDIA has announced plans to ship “Kepler” products starting early in 2012. Few details on Kepler have been made public other than the predicted increase in Double Precision (DP) efficiency and how that will come from a combination of architectural changes and a smaller process geometry.

Given that Kepler offers something on the order of a fivefold increase in DP GFLOPS per watt, how will we get to ExaFLOP performance without requiring a dedicated power plant? Public presentations from senior staff at NVIDIA give some insights.

Firstly, they posit that architectures with explicit management of on-chip memory can be more efficient than their counterparts that employ hardware-driven cache hierarchies. The argument goes that distance = power when it comes to fetching operands, and that that power exceeds the power taken to do the actual mathematical operation. Even on a 28 nm process device, the difference in power taken to fetch an operand from close to the gates that will perform the operation to that taken to fetch from the other edge of the die can increase from tens of picoJoules to hundreds of picoJoules. Having to reach out off-chip to DRAM can cost tens of nanoJoules (in addition to a huge increase in access latency). This gives credence to the adage that the FLOPS are almost free; you pay to move the data. At Exascale, even small differences in the energy it takes to fetch an operand can have a huge impact on the system power consumption.

Secondly, NVIDIA is increasing the autonomy of the GPU and independence from the x86 CPU by integrating ARM cores into the GPU itself. This will add significant flexibility to the GPU, with new options for I/O, fused address spaces, self-scheduling, and lower latency overall. To see the effects of this, we need to look no farther than NVIDIA’s Tegra devices, because that’s where this technology is being developed. The current Tegra3 SoC (codenamed “KalEl”) utilizes a quad-core ARM Cortex-A9 and a fifth ultra-low-power “Companion” core. Together, these ARM cores provide a dynamic range of serial compute performance, while more broadly supporting the high-performance GPU cores. While Exascale computing is a dot on the horizon, and even then will be limited to huge supercomputer installations, it can be expected that the advances that will be necessary in power efficiency will scale down to enable embedded systems also.

It is currently estimated that a GPU-based ExaFLOP machine will require 5,120 nodes and consume 15 MegaWatts. This equates to around 66 DP GFLOPS per watt. Contrast that with today’s 2 DP GFLOPS per watt, and this starts to indicate that NVIDIA does indeed have a path to Exascale that matches the Fermi/Keplar/Maxwell curve on the road map.

New techniques help with heat dissipation

Even if the promised performance per watt improvement from Fermi to Kepler holds true, it is possible that, with the process shrink from 40 nm to 28 nm that is widely expected, thermal density may increase. As a die shrinks, the surface area from which waste heat is to be extracted reduces as the square of the reduction in feature size. This makes the extraction of the heat increasingly difficult.

However, there are several innovations that are starting to make their way from the laboratory to the field that will help to alleviate this issue. Thermal ground planes take the technology from heat pipes that are commonly applied to commercial electronics, and, by using innovative designs and constructions, allow it to be applied in environments with high shock and vibration levels – even negative-g forces. By applying such devices to conduction-cooled boards, the thermal path from die to heat frame can be much improved.

The next link in the path that heat must traverse before ultimately being dispersed to the environment is the wedgelock. This interfaces the heat frame to the chassis and forms the conducting interface by expanding when the screw mechanism inside is torqued. By reworking the construction of the wedgelock with new materials, the efficiency of the heat transfer can be increased.

Improving the heat path not only means that devices with higher power dissipation than previously could be handled may be used; just as important, by maintaining a lower maximum die temperature, the long-term reliability of the device can be improved.

New applications enabled

Early adopters of GPGPU technology in the mil/aero space focused on what it could do in terms of raw performance. As its potential becomes better understood, however, designers are looking to broaden the range of applications in which it can be deployed. This, in turn, is putting the spotlight on the performance per watt characteristics of GPUs. The good news is that manufacturers like NVIDIA are acknowledging this, and developing future generations of products that deliver not only optimum performance, but also significantly improved performance per watt.

Peter Thompson is Director of Applications, Embedded Systems, at GE Intelligent Platforms. With an honors degree in Electrical and Electronic Engineering from the UK’s University of Birmingham, Peter has worked for more than 30 years in embedded computing, joining Radstone – subsequently acquired by GE – in 2005. He can be contacted at peter.thompson@ge.com.

GE Intelligent Platforms www.ge-ip.com