Intel Architecture enables digital signal processing

Story

February 17, 2011

Historically, system designers have tended not to consider Intel?s offering for DSP applications. With the advent of the latest architectural innovations, that seems set to change.

Military and aerospace applications have an insatiable need for processing power, but often have hard power limits. Many viable technologies exist, including DSP chips, FPGAs, and others, but in recent years developers have become accustomed to the ease of programming that comes with the use of general purpose processors, typically supplemented by integrated vector processing engines. Now, with a number of architectural features that are introduced in the latest Intel Core i7 processor products, there is a convergence of ubiquitous availability, ease of programming, and raw GFLOPS performance that makes a compelling case to consider Intel architecture chips for the most challenging signal- and image-processing requirements that mil/aero has to offer.

Considering the road map

Traditionally, the x86 family was not particularly adept at processing streams of Floating Point (FP) data. In fact, the original Intel 8086 and successors did not have an inherent FP capability. That was left to an add-on dedicated Intel 8087 FP chip or slow software emulation. By the time the Intel Pentium processor era arrived, things had changed somewhat with the introduction of the MultiMedia eXtension (MMX technology) instructions and execution unit. This was primarily targeted at encoding and decoding various audio and video streams, but with a little ingenuity could be used for other signal-processing tasks, albeit with a limit to integer operation. Over time, MMX technology begat Intel Streaming SIMD Extensions (Intel SSE), which extended the instruction set to be ever more DSP-friendly, including floating point support.

Along the way, a non-x86 device – the Intel i860 – emerged. This was designed as a graphics device, but an embedded industry that was starved for performance discovered that it was actually pretty darn good at DSP. Unfortunately, it was a niche product for Intel when compared with the mass markets that the x86 line served, and eventually, after a couple of generations, it fell by the wayside.

Fast-forward to late 2010/early 2011, and Intel is starting to ship the second generation of the Intel Core i7 processor line, which implements a microarchitecture codenamed “Sandy Bridge.” Once again, the SIMD floating-point unit has been updated – to Intel Advanced Vector Extensions (Intel AVX) (see Sidebar 1). New architectural features and migration tools make these devices worthy of consideration for DSP applications.

Sidebar 1: Once again, the SIMD floating-point unit has been updated – to Intel Advanced Vector Extensions (Intel AVX). New architectural features and migration tools make these devices worthy of consideration for DSP applications.

(Click graphic to zoom by 3.0x)

New architectural features boost DSP performance

The most apparent benefit of Intel AVX is doubled vector-pipeline width. The execution unit and associated register file are both now 256 bits wide, increased from the 128 bits of previous generations. This alone can account for nearly a doubling of floating point vector performance. The execution unit uses Single Instruction Multiple Data (SIMD) operation to increase throughput beyond what has become an effective wall in terms of what is viable by simply increasing clock rates and reducing die geometries. Clock rates above 2-3 GHz see diminishing returns for the power consumed, and as geometries migrate to 32 nm and smaller, inefficiencies resulting from leakage become more significant. A 256-bit vector pipeline can execute eight single precision (32-bit) FP operations concurrently (the same instruction, but on eight different sets of data points), compared with four on SSE implementations. In some cases, two instructions can be executed per cycle, such as when doing a multiply and add at the same time, thus allowing 16 operations concurrently (versus eight with SSE). In reality, each operation takes multiple cycles, but by employing pipelining, once a startup penalty has been paid, results are available on every cycle. This SIMD operation maps naturally to many DSP algorithms as they tend to feature the required data parallelism.

There will be members of the new Intel Core i7 processor family with two and four cores featuring the Sandy Bridge architecture. Each core has its own Intel VX unit. This means that a quad-core version can potentially execute 64 single precision FP operations every clock cycle. Contrast this with the 16 operations per cycle on first-generation, dual-core i7 devices.

New instructions such as broadcasts and masked loads and stores enable better utilization of the available FLOPS. Changes to the memory unit allow for two read requests and one write of 16 bytes each in one cycle. This is a key feature in keeping the execution unit fed with data and avoiding the pipeline stalls that can severely impact performance. These features serve to close the efficiency gap that existed between AltiVec and SSE in some DSP algorithms that required data reorganization for efficiency. Figure 1 shows some of these features.

Figure 1: The microarchitecture of Intel’s Sandy Bridge lends itself well to DSP applications.

(Click graphic to zoom)

Hence, the key elements in deploying these processors in mil/aero signal- and image-processing applications include:

The availability of BGA devices, which allow the components to be soldered down, rather than using socketed devices that can fail under high levels of shock and vibration
Intel’s seven-year life cycle for parts on the embedded platform road map

Harnessing DSP application performance

So what does all this mean for performance? In an attempt to answer that question, several DSP applications have been run on previous generation i7 class devices codenamed “Arrandale” and on Sandy Bridge. When comparing performance using a single core, with both classes operating at the same clock rate, an increase in performance of approximately 2x has been demonstrated with Sandy Bridge. This illustrates both the theoretical speedup that would be expected from doubling the SIMD unit width, and also the balance of memory access needed to achieve that increase.

All this performance is great, but how can it be exploited for application domains such as radar, SIGINT, ELINT, and so on? At the lowest level and highest complexity is programming the Intel AVX unit using primitives that can be called from C or other high-level languages. While no more complex than the assembly code programming that many programmers cut their teeth on in the past, getting good performance at this level is not a trivial task. Many factors must be understood and factored in when coding to avoid pipeline stalls and resource contention.

Compilers offer some help. Several already have Intel AVX support, coupled with varying degrees of automatic vectorization. Source code is analyzed, and where possible, for-loops are mapped to Intel AVX SIMD operations. This can be a help with dusty-deck code, but in reality there are many impediments that can preclude effectiveness, as the compiler must always err on the side of caution. Where ambiguity exists over loop iterators, for instance, the compiler cannot make assumptions that it cannot verify, so it will always generate lower-performance, guaranteed-correct code.

Math libraries offer a good alternative. Intel Integrated Performance Primitives (Intel IPP) and Intel Math Kernel Library (Intel MKL) are highly tuned for Intel AVX. Algorithm coverage is broad, and performance is hard to beat. However, in some eyes, they suffer from being seen as proprietary Application Programming Interfaces (APIs). Also, support for operating systems tends to stick with the mainstream of Windows and Linux. Support for real-time operating systems common in the embedded world is currently lacking, for the most part.

As an alternative, many programs turn to more open standard libraries, such as the Vector Signal- and Image-Processing Library (VSIPL) sponsored by DARPA as a cross-platform, cross-vendor API, and VSIPL++, its C++ sibling. These libraries can help isolate applications from underlying architectures. For example, many applications written for the AltiVec SIMD unit on Freescale devices can be migrated to Intel AVX on Sandy Bridge with a recompile. Under the covers, VSIPL may be implemented with handcrafted Intel AVX code, or may take advantage of Intel IPP/Intel MKL, or may be a mixture of both. In any case, the end result is highly tuned performance at the application level without the coder having to understand the low-level nuances of the chip. They can instead focus on their domain of knowledge.

Figure 2: GE’s SBC622 is a 6U OpenVPX single board computer featuring the Core i7 processor.

(Click graphic to zoom by 1.9x)

I7 today and tomorrow

Since the introduction of SSE, Intel processors have been used for signal- and image-processing applications. With the introduction of AVX, this adoption seems set to accelerate. The benefits that this architecture brings to raw performance and to GFLOPS/Watt metrics are immediately apparent. Additionally, given Intel’s track record on die shrinking new architectures (tick-tock development), this can be expected to continue. GE Intelligent Platforms, a member of the Intel Embedded Alliance, has many Intel i7 products shipping today (Figure 2) and will be introducing new products as soon as the new Sandy Bridge architecture processors are released.

Peter Thompson is Director of Applications, GE Intelligent Platforms. He can be contacted at peter.thompson@ge.com.

GE Intelligent Platforms 978-437-1477 www.ge-ip.com

Peter Carlston is Platform Architect, Embedded Computing Division, at Intel Corporation.

Intel Corporation www.intel.com