Solving the processor challenges for safety-critical softwareStory
October 12, 2011
Multicore, hyperthreading, Dynamic Frequency Scaling (DFS), and DMA are modern processor features aiming to optimize average-case execution times. Such optimizations can result in challenges for safety-critical software designers, who must focus on worst-case behavior, though. However, these issues can be successfully mitigated.
Modern processor features such as multiple cores, hyperthreading, and high-speed DMA are designed to optimize average-case execution times. However, these optimizations often come at the expense of worst-case execution times and make systems more difficult to bound. This situation presents significant challenges to developers of safety-critical software, who must design for worst-case behavior. Thus, the following discussion examines why worst-case behavior is focused on in the software development process, as well as some of the key processor-related challenges facing developers of safety-critical software and ways of addressing them.
Why focus on worst-case behavior?
In a safety-critical software environment, one must ensure three key things:
First, each periodic thread (or task) must always execute at its defined rate (for example, 100 Hz). This is important because each thread must perform at a given rate or else the system can become unstable and, hence, unsafe.
Second, each periodic thread must be allocated a fixed time budget that it cannot exceed (for example, 200 microseconds at 100 Hz). This is important because it allows the underlying RTOS to enforce time partitioning.
Third, each periodic thread’s fixed time budget must be adequate to cover the thread’s worst-case behavior. This is important because many safety-critical threads must execute to completion in every period. If they do not, the system can become unstable and, as a result, unsafe.
Note that this set of requirements stands in stark contrast to noncritical software systems, where one wants overall performance at the highest level, but can tolerate occasional “glitches” where performance is slower than average.
Multicore and cache/memory contention
CPU throughput has roughly doubled every 18 months since 1985, consistent with Moore’s Law. However, that trend began slowing in about 2005 because of three key factors. The main reason is that memory speed has not kept up with CPU performance, increasing only about 10 percent per year during this same timeframe. Larger caches help alleviate this problem, but memory subsystems remain significant performance bottlenecks.
Theoretically, greater parallelism should increase peak performance by enabling the CPU to process multiple instructions concurrently. However, techniques like pipelining, branch prediction, and speculative execution have begun to “hit a wall,” making it increasingly difficult to exploit that parallelism.
Thermal factors have also slowed the advance of CPU throughput. As operating frequencies increase, power consumption and heat generation increase proportionally. Dissipating this heat presents difficult challenges in many environments, particularly for passively cooled embedded systems.
Recently, multicore processors have evolved to meet many of these challenges. To boost memory throughput, for example, each CPU core is equipped with its own L1 cache. Tighter physical packaging also boosts performance by shortening signal runs between cores, which makes data transfers proportionally faster and more reliable. Meanwhile, multiple cores enable processors to execute more instructions per clock cycle. This enables each core to run at a lower frequency, thereby consuming less power and generating less heat.
Despite these advances, multicore processors still present challenges for developers of safety-critical software: primarily, increased contention for shared resources such as L2 cache and the memory subsystem. Figure 1 shows a simple dual-core processor, each core with its own CPU and L1 cache, both cores sharing an L2 cache and a RAM subsystem.
Figure 1: A simple dual-core processor, each core with its own CPU and L1 cache, both cores sharing an L2 cache and a RAM subsystem
(Click graphic to zoom)
The values listed on the left side represent the “cost” that each CPU incurs when accessing a given resource. For example, say it costs one cycle for the CPU to access its local L1 cache. If that access misses and the CPU has to go to the L2 cache, it costs 10 cycles. If the L2 cache misses and the CPU has to go to RAM, the cost is 100 cycles. If the cache is “dirty” and “write-backs” are needed, performance is even worse. Note that these numbers aren’t intended to be exact, and will vary from processor to processor, but the relative orders of magnitude are typical. The important point is that the further out the CPU has to reach to access data, the more time the data transfer takes.
Contention arises when multithreaded processes on a CPU simultaneously compete for that core’s L1 cache, and when multiple cores simultaneously compete for the shared L2 cache and memory subsystem. Even with a single-core processor, the CPU can easily overwhelm the memory subsystem. In a multicore system, where multiple cores must contend for shared memory resources, the memory access bottleneck is much worse.
Slack scheduling and cache partitioning
One way that developers can mitigate memory contention and harness the power of multiple cores while still meeting worst-case execution requirements is to utilize a real-time operating system that is optimized for safety-critical applications. DDC-I’s Deos, for example, provides cache partitioning and slack scheduling facilities that alleviate memory access bottlenecks, enhance determinism, and increase CPU utilization for safety-critical applications spanning one or more cores.
Cache partitioning reduces memory contention and worst-case execution time by enabling designers to dedicate a portion of the cache to each core. With this physical partitioning, the total amount of cache available to each core is reduced. However, overall contention is reduced, as multiple cores no longer share the same resource.
Slack scheduling, meanwhile, takes advantage of the fact that the average thread execution time is typically much shorter than the worst-case execution time. For those threads where the actual execution time is less than worst-case budgeted time, the RTOS reclaims the unused time and reallocates it to other threads, thereby boosting overall system performance.
HT allows increased parallelization of computations by duplicating parts of a processor that store a certain application state without duplicating the processor’s main processing engine (CPU). In this way, an HT processor appears as two logical processors to the RTOS. HT technology can also be used in a multicore setting where each core has two logical cores.
The advantage of HT processors is increased parallelization of application code, and improved reaction and response times. Some HT processors, for example, have shown performance improvements of up to 30 percent as compared to non-HT processors. Unfortunately, realizing this performance is difficult with safety-critical software, as HT increases contention for the cache and memory subsystem, and makes the system more difficult to bound. As such, HT must be disabled in many safety-critical applications.
Dynamic Frequency Scaling (DFS)
DFS (also known as CPU throttling) allows the frequency of a processor’s clock to be adjusted in real time, either to conserve power or reduce the amount of heat generated by the chip. Though primarily used in battery-powered mobile devices, DFS can also be used in passively cooled avionics systems that must meet stringent heat profiles using only ambient air. DFS is generally used in conjunction with Dynamic Voltage Scaling (DVS), as the frequency is proportional to operating voltage, and power consumption increases as the square of voltage.
DFS and DVS can save power and reduce heat, but in a safety-critical environment, they are problematic because they also reduce the number of instructions a processor can issue in a given amount of time (including slowing down memory bus access). Consequently, performance might be reduced in an unpredictable fashion that is difficult to bound. DFS and DVS can be disabled if power consumption is not a gating factor. Alternatively, designers who want to utilize DFS and DVS can do so by measuring worst-case performance while running the processor at the lower frequency/voltage, and then budgeting accordingly.
Direct Memory Access (DMA)
DMA boosts performance by allowing devices to move large amounts of data (including map displays and terrain databases) to and from system memory without involving the CPU, thereby freeing the CPU to do other work. For safety-critical software, the main disadvantage of DMA is that it operates outside the control of the CPU and the Memory Management Unit (MMU). Thus, a flaw in the DMA controller can break space partitioning. One way to mitigate this problem is to use an RTOS with special DMA controller software that meets the highest level of design assurance.
With the help of an RTOS like Deos, designers of safety-critical systems can reap the performance benefits of advanced processors with multiple cores, high-speed DMA, and DFS without compromising worst-case execution time. Not all advanced processor features, however, are well suited to safety-critical applications. Some such as hyperthreading, while ideal for boosting average performance, simply lack the determinism required for safety-critical applications and must be disabled.
Tim King is the Technical Marketing Manager at DDC-I. He has more than 20 years of experience developing, certifying, and marketing commercial avionics software and RTOSs. Tim is a graduate of the University of Iowa and Arizona State University, where he earned master’s degrees in Computer Science and Business Administration, respectively. He can be contacted at [email protected].
DDC-I 602-275-7172 www.ddci.com