Cores and threads: Hybrid processors for today’s multitasking world

Story

February 13, 2024

Aaron Frank

Curtiss-Wright

The incredible growth of processing parallelism has resulted in a corresponding explosion of performance and capabilities, but not all cores and threads are created equally. For mainstream computer users, such as the vast majority of Windows users, the detailed usage of cores and threads is not important for the user to understand. After editing a document, we hit the <SAVE> icon, and all the magic happens under the hood. But for designers of critical real-time processing systems, what happens under the hood matters. With a more detailed understanding of the latest hybrid core processor enhancements, military embedded systems designers – whether designing for land, sea, or air use – can build more deterministic and responsive processing systems and at the same time maintain better control over power consumption, resulting in SWaP [size, weight, and power] savings and longer-duration missions.

Today, finding a processor with just a single processing core is difficult. In 2000, IBM introduced the concept of a dual-core processor in their Power4 processor. AMD followed in 2005 with the Opteron 800 and Athlon 64 X2 processors, each with two processing cores. Intel gained commercial success with their dual-core processor in 2006 with the Pentium Core2 processor.

Today, almost two decades later, it is not uncommon to see data centers running tens of thousands of processors, each with 64 or more cores. In addition to multiple processing cores, many architectures also support hyper-threading, which enables a processing core to execute two independent instruction threads simultaneously, mimicking a dual-core processor. Thus, a 64-core dual-threading processor can execute 128 independent threads simultaneously. Taken to the extreme, today’s high-end graphics processors (GPUs) can execute thousands of simultaneous operations, which is fundamental for highly parallel 3D visualizations and complex AI [artificial intelligence] processing tasks. (Figure 1.)

[Figure 1 ǀ Processing cores are shown in a CPU versus a GPU.]

Processor evolution and architecture

Figure 2 illustrates a simplified view of a generic single-core processor. Important to this discussion is the data flow to and from a processing core. With few exceptions, a processor is paired with external main memory, where instructions and data are stored. Accessing even today’s fastest DRAM memory subsystems is considered slow compared to the speed at which the core operates. To ensure the processing core does not sit idle waiting for memory interactions, most processors incorporate cache memory, which is a region of extremely fast local memory operating at the core speed, which mirrors regions (sometimes referred to as pages) of the external DRAM memory. If instructions and data are preloaded into the local cache memory, the processing core can run at full speed without waiting. Unfortunately, if the needed instructions or data are not preloaded in the local cache memory, the processing core will become stalled while the rest of the CPU fetches the required data from external DRAM memory into the cache. This cache miss results in a loss of performance.

[Figure 2 ǀ A simplified view shows a generic single-core processor.]

Because cache memory is expensive in terms of silicon space, most processors have multiple levels of cache. The cache closest to the processor, called the L1 cache, is the smallest and fastest, with progressively larger size and slower access caches as you move further from the processing core (L2 cache, L3 cache, etc.). In a multicore processor, each core typically has its own L1 cache, and often, multiple cores will share L2 or L3 cache regions. Figure 3 illustrates two generic quad-core processors. One has only two cache levels, with an L1 cache for each core and an L2 cache shared amongst the four cores. The second example has an L1 cache for each core, an L2 cache shared amongst each pair of cores, and an L3 cache shared across all four cores.

[Figure 3 ǀ A view of multicore processors shows multiple and shared cache levels.]

It is important to note that for a multi-core processor, the architecture has areas where multiple cores share common resources. It may be a shared cache memory region, a shared main interface bus, or a shared memory controller. The implication is that two cores may not be able to fully operate independently – there will be some interaction due to contention with shared resources. Thus, a dual-core architecture cannot provide a full doubling of performance compared to a single-core architecture. Similarly, a quad-core processor will not provide four times the performance. Real-world conditions reduce this performance to something less.

Multicore versus hyper-thread

In 2002, Intel introduced the concept of hyper-threading: In a true multicore processor, the core is duplicated, and each core has its own L1 cache, as shown on the left of Figure 4. A hyper-threading core is just a single core that appears to the operating system (OS) as two logical cores, as shown on the right of Figure 4. A hyper-threading core is accomplished by using an internal superscalar architecture, in which multiple instruction streams can operate on independent instruction and data in parallel.

[Figure 4 ǀ Shown: multicore versus hyper-threading core.]

A hyper-threading core has more shared resources than two independent cores, and its overall performance in real-world applications is expected to be correspondingly lower than two separate processing cores.

Whereas Intel x86 and NXP Power Architecture provide hyper-threading cores in many of their processors, the Arm architecture does not offer hyper-threading. An Arm core is simply a single-threaded core. A 16-core Arm processor, such as the NXP LX2160A, provides 16 fully independent cores and can execute 16 independent threads. In contrast, an Intel eight-core processor such as the Tiger Lake Xeon W-11865MRE provides eight hyper-threading cores and presents as 16 logical processing cores to the OS.

big.LITTLE Architecture

In 2011, ARM Holdings introduced the first processor with what it called the big.LITTLE architecture. Realizing that real-world multitasking systems have a wide range of processing and performance needs, the architecture pairs some “big” cores optimized for high-performance with some “LITTLE” cores optimized for higher efficiency and sacrifices some amount of performance. Systems that make use of big.LITTLE processors will direct background and noncritical functionality to the LITTLE cores and will direct foreground and user-oriented functionality to the big cores. The goal of a big.LITTLE processor is to ultimately save power, a critical resource in battery-operated equipment such as laptops and cell phones, and to ensure responsiveness to users. For example, an Apple iPhone 13 uses the Apple-designed A15 Bionic Arm processor with two big cores and four LITTLE cores. When not actively in use, the phone will utilize only LITTLE cores, putting the big cores to sleep to reduce power consumption.

Adoption of a hybrid processor core architecture

While not the first to make use of a hybrid core architecture, Intel has introduced its equivalent to the Arm big.LITTLE architecture, offering hybrid core processors with what they call “performance” cores (aka big or P-cores) and “efficient” cores (aka LITTLE or E-cores).

The Intel 12th-gen “Alder Lake” and 13th-gen “Raptor Lake” processor families include embedded processor SKUs with up to 16 cores, consisting of eight performance P-cores and eight efficient E-cores. With Intel P-cores cores supporting hyper-threading, the processor presents to the OS as a 24 logical core processor (eight hyper-threading P-cores and eight single-thread E-cores).

Using hybrid processing cores

Operating systems are now becoming aware of different application processing needs. Foreground processes, such as those interacting with users (applications in focus, visual displays, user interaction via mouse and keyboards, etc.), can be assigned to big/Performance cores to provide the best user experience, with background activities (low-priority tasks, utility functions, system management, etc.) can be assigned to LITTLE/Efficient cores where high performance is not required.

Core usage in multitasking operating systems

Intel Thread Director

To make the best use of P-cores and E-cores, Intel provides a technology called the Thread Director to the OS. This technology enables the OS scheduler to assign tasks to P-cores and E-cores based on each task’s characteristic needs for performance versus efficiency.

Microsoft Windows 11

Under Windows 11, the Thread Director works closely with the Windows task scheduler, which has been enhanced to be aware of hybrid processor architectures. In this enhancement, the Windows 11 task scheduler considers P-cores, E-cores, and hyper-threads on P-cores when scheduling tasks. In addition, the Windows 11 task scheduler and the Intel Thread Director also monitor other processor parameters, such as clock speeds, power consumption, and thermal conditions.

Under Windows 11, workloads are monitored and classified as follows:

Class 0: Most applications
Class 1: Workloads using AVX/AVX2 instructions
Class 2: Workloads using AVX-VNNI instructions
Class 3: Bottleneck is not in the compute, e.g., I/O or busy loops that don’t scale

Anything in Class 3 is recommended for E-cores. Anything in Class 1 or 2 is recommended for P-cores, with Class 2 having higher priority. Everything else fits in Class 0, with frequency adjustments to optimize for IPC [instructions per cycle] and efficiency if placed on the P-cores. Even with all these conventions, the OS may still choose or be directed to assign any thread or class of workload to any core. Windows 11 also considers the computer’s selected power plan, where a high-performance power plan will perform differently than a balanced or battery-saver plan.

Microsoft Windows 10

While the Thread Director also works with Microsoft Windows 10, the Windows 10 task scheduler is not designed to work optimally with the Thread Director. Under Windows 10, the scheduler assigns P-cores to the application in focus, meaning the currently highlighted application window. If an application is taken out of focus, either by minimizing the application or highlighting a different application window, the Thread Directory reassigns the application to E-cores. Some users have reported mixed feedback with these processors under Windows 10, with the main concern being applications that are inactive or not in focus underperform when directed to E-cores.

Figure 5 shows the core usage of an Intel Alder Lake i7-12700H processor with 20 logical processor threads (6 P-cores with hyper-threading plus 8xE-cores). The first 12 graphs (from top left) show the workload of P-core threads, and the last eight graphs are E-core threads. The figure on the left shows that all 20 cores are in use when the application is in focus, driving the processor to an overall utilization of 77%. The figure on the right shows that when the application is minimized or taken out of focus, the application is removed from the P-cores and only executes on the lower-performance E-cores, driving them to maximum usage. The overall processor utilization is reduced to 53%, which reflects that the application task is likely underperforming with E-cores, while the P-cores mostly sit idle.

Also of interest: These screen captures provide a process count and a thread count, which gives insight into the number of total application processes and threads the OS is concurrently managing. In these examples, there are 229 or 223 processes running, and 3,345 or 3,349 application threads running. While many of these processes and threads may be sleeping or idle, most will wake up periodically to perform a task or status check.

[Figure 5 ǀ An Alder Lake processor under Windows 10 showing an application in focus and out of focus.]

Linux

As of early 2023, apparently to address some reported performance bugs with the 12th-gen Alder Lake processor, Intel has added some, but not all, aspects of Linux-kernel interaction with its Thread Director to Linux kernel 5.18. Officially, however, Intel has only stated publicly that Windows 11 is their priority, and they would be upstreaming a variety of features in the Linux kernel over time. More recently, it has been reported that Linux kernel 6.2 has added further support for Intel’s 13th-gen Raptor Lake hybrid processors, including enhancements for the Thread Director.

Linux users have always had the ability to manually assign processes to logical processor cores using the taskset() command. With specific knowledge of which logical processors are P-cores and which are E-cores, it is not difficult to manually assign process affinity to specific cores. This can provide an embedded software developer incredible flexibility using hybrid core processors.

Hybrid cores going forward

Approximately 70% of mobile phones today operate with processors using the Arm big.LITTLE architecture. Intel, one of the largest processor suppliers, has adopted a hybrid (P-core/E-core) core architecture with their last two generations of consumer processors and appears to be focused on extending this architecture for future generations. The mainstream commercial processing world has embraced the benefits of the hybrid core architecture.

While the aerospace and defense industry has yet to widely adopt the hybrid core processor, it is hard to ignore the potential benefits of this new technology, which promises an increase in processing efficiency. Size, weight, and power (SWaP) remain a primary consideration for all new developments, and any opportunity to increase processor efficiency will directly benefit a defense solution’s SWaP footprint.

Aaron Frank is senior product manager for Curtiss-Wright Defense Solutions and has been with the company since 2010. As a senior product manager within the C5ISR group, he is responsible for a wide range of COTS products utilizing advanced processing, video graphics/GPU and network-switching technologies in many industry-standard module formats (VME, VPX, etc.). His focus includes product development and marketing strategies, technology roadmaps, and being a subject-matter expert to the sales team and with customers. Aaron has a bachelor of science/electrical engineering degree from the University of Waterloo (Ontario).

Curtiss-Wright Defense Solutions https://www.curtisswrightds.com/