Enhancing application performance on multicore systems

Story

February 17, 2011

John Blevins

Lynx

In a simple view of the world, an application running on a multicore system should run at least as fast as the same application runs on a single-core system with the same CPU power. Unfortunately, in practice, that is not the case. However, by implementing application parallelism and using an SMP OS and an embedded hypervisor, one can solve the challenge and realize dramatic performance improvements.

Let’s face it. The Multicore Era is upon us. How did we get here? For years, processor manufacturers delivered on the Moore’s Law promise of doubling CPU performance every couple years by increasing the number of transistors, increasing clock rates, and increasing instruction-level parallelism. We saw clock rates go from 1 GHz in the year 2000 to 2 GHz in 2001 and finally to 3 GHz in 2002, where we seem to have hit a clock-speed brick wall.

The combination of increasing power demands and rising chip temperatures seems to have put the brakes on the clock speed race. One of the surest signs was Apple’s switch from the PowerPC to the Intel Architecture. Promises of 3 GHz PowerPC G5 processors in Apple laptops never materialized due to power and heat problems. Processor manufacturers quickly realized that to keep doubling performance, they needed a new trick. That new trick was to add multiple cores to a chip. Now Mac computers with 12 cores at 2.93 GHz exist, and coincidentally, dual- and quad-core single board computers and systems are very prevalent in the military embedded realm today.

However, if the software running in the system is not optimized for multicore, there can be degradation in performance when migrating from a single-core based system. The higher transistor density in a multicore CPU does not generally translate to an increase in speed on applications that are not parallelized. Additionally, SMP OS and embedded hypervisor RTOS technologies can aid optimum performance in single-core to multicore migration. But first we will examine the issues of synchronization, concurrency, and scheduling.

Synchronization and concurrency overhead kill performance

A real-time embedded SMP operating system must maintain deterministic scheduling and interrupt response as well as respond rapidly to interrupts and high-priority tasks. This job becomes substantially more difficult and time consuming in a multicore CPU. For example, when multiple CPUs are active, data may be accessed by more than one of them simultaneously, adding a new level of concurrency issues for the OS to deal with. This requires additional mechanisms for concurrency control. Because of the increased concurrency in multicore systems, locking and synchronization mechanisms are more complex and take more CPU time than in single-core systems.

Generally, on a single-core system, a critical section of code can be protected from interrupts by simply disabling preemption, doing the necessary work, and re-enabling preemption. Enabling and disabling preemption can be as simple as storing a value in a variable. In a multicore system, each core has its own unique set of interrupts, so disabling preemption does not make a lot of sense, since code on the other cores could still execute the critical section. Instead some form of locking needs to be introduced. In LynxOS, a deterministic hard real-time OS from LynuxWorks, this is done with Kernel Spinlocks and requires the use of special hardware mechanisms such as locked bus cycles. These mechanisms are considerably more complex than the simple memory accesses a single-core system uses, and this complexity adds to overall performance degradation.

In addition, when code on one core is in a critical section, code on other cores is blocked waiting for the code to finish the critical section. If the locks are coarse-grained, it is possible that several cores could be idle because they are unable to schedule any useful work.

The synchronization and concurrency overhead incurred on multicore systems is most visible at the operating system software level, but is also apparent in multi-threaded applications that rely heavily on constructs like condition variables, semaphores, and message queues. Operating systems compiled to support multiple cores are typically about 10 percent slower on a single-core system than the same operating system compiled to support a single core.

Scheduling threads across multiple cores

The scheduling algorithms play a key part in harnessing the power of multiple cores and can cause performance issues if not implemented carefully. Typical scheduling algorithms maintain a per-CPU queue of threads that are ready to run and allocate CPU time based on this queue. However, in a real-time system, it is critical to preserve real-time determinism, so the scheduling approach is different. The scheduling happens on a global basis where the highest-priority thread runs on the first available CPU. However, this may lead to higher levels of cache misses. This can be addressed by using design optimizations in real-time thread scheduling.

One such design optimization, known as processor affinity, allows applications to request an “affinity” to a processor core. In this case, the operating system schedules the applications on the preferred processor core, as long as it does not affect overall system scheduling. A more rigid form of processor affinity is processor binding, where the task is always scheduled on the same processor core. However, this approach in RTOSs may lead to priority inversions. Operating system design should accommodate considerations such as processor affinity without degrading real-time determinism and responsiveness. In the context of a real-time operating system, other key factors such as priority scheduling and interrupt latency should be preserved in multicore architectures.

An SMP-enabled real-time operating system must schedule tasks dynamically and transparently between processors to efficiently balance workloads using available processors. It optimizes the support of load balancing on multiple cores along with preserving the key elements of real-time latency and determinism. If the operating system “bounces” the application from core to core, the application will take additional Translation Lookaside Buffer (TLB) and cache misses, reducing performance. On the other hand, if the application is “pinned” to a core, there may be enough additional demand placed on that core to slow down the application, compared to running it on a single core.

Taking full advantage of multicore processors

To maximize multicore performance, application parallelism, SMP-enabled OSs, and embedded hypervisor technologies should be explored.

Application parallelism maximizes CPU utilization

All applications should be carefully examined for opportunities to parallelize the tasks. In parallel computing, an application is broken down into threads that execute independently on separate cores (Figure 1). Application parallelism is dependent on the ratio of computation to communication overhead. The computation is the amount of time the CPU spends executing application code. The communication overhead is the amount of time that the OS spends in communicating between cores. In a typical multicore architecture, the communication overhead indicates how often messages are sent between different cores. The more threads an application has, the higher the chances that they are scheduled on different cores, which in turn increases the communication overhead.

Figure 1: In parallel computing, an application is broken down into threads that execute independently on separate cores.

(Click graphic to zoom by 1.9x)

Each type of system has different characteristics, but when optimizing application parallelism to maximize performance, there are broadly two types of application parallelism that can be used:

1. Coarse-grained parallelism is characterized by large tasks, single threaded and low communication overhead. In this case, the ratio of computation to communication overhead is high. This indicates that the communication overhead is lower than computation time, thereby yielding better multicore performance.

2. Fine-grained parallelism is characterized by small tasks, multithreaded and high communication overhead. In this case, the ratio of computation to communication overhead is low. This indicates that the communication overhead is higher than computation time, thereby yielding lower multicore performance.

Applications that are CPU-bound can exploit the full power of multicore architectures since they are coarse-grained, while memory-bound or I/O-bound applications (fine-grained) may need to be optimized to avoid the bottlenecks that arise due to the communication overhead in symmetric multiprocessing architectures.

POSIX-based OSs provide a rich environment of threading functionality to make it easy for developers to implement parallelism in their applications. Developers must consider the design trade-offs of using multithreading versus non-multithreading to harness the power of multiple processor cores. In some instances, applications may perform better on a single-core system.

Multicore optimization with SMP OS and hypervisor technology

Another approach to multicore optimization centers around choosing an appropriate OS. An SMP-enabled OS can help add concurrency to an application by balancing the threads running on multiple CPUs and maintaining a deterministic hard real-time performance level.

But what if you could get even more control over how the OS runs on the multicore CPU? A new trend emerging in multicore environments is the use of a small hypervisor operating system, which abstracts the capabilities of hardware and allows multiple heterogeneous operating system instances to run on a single hardware platform. A Type 1 hypervisor, such as LynxSecure from LynuxWorks (Figure 2), runs directly on the hardware and has complete control of the platform, providing superior utilization of processor resources. In the SMP-enabled hypervisor, a single copy of the hypervisor can allow a single guest operating system to utilize multiple cores. The same hypervisor can enable AMP by allocating a single guest operating system to a unique core. This can be extended to allow AMP and SMP on the same platform through judicious allocation of guest operating systems on single or multiple cores, thereby increasing processor utilization significantly.

Figure 2: A Type 1 hypervisor runs directly on the hardware and has complete control of the platform.

(Click graphic to zoom by 1.9x)

John Blevins is the Director of Product Marketing and Tools Development at LynuxWorks, with more than 25 years of software experience in the embedded industry. Contact him at jb@lnxw.com.

LynuxWorks 408-979-3900 www.lynuxworks.com