Developing high-performance embedded network security applications: A heterogeneous multicore processing approach

Story

May 31, 2010

Daniel Proch

Netronome

Today's network and computing infrastructure is of critical interest to our national security. The vital communication systems supporting our multinational forces are valuable resources that share a couple of important characteristics: They require massive amounts of bandwidth to support our insatiable appetite for IP-based services and they require mechanisms offering visibility into all data protocol and application layers to ensure network security.

Unfortunately, the network appliances commonly hosting security applications have failed to keep pace with improvements in network performance. However, a new heterogeneous multicore processing architecture can scale to support tomorrow's throughput needs while providing the ability to see deeply into network traffic. (U.S. Army photo by Staff Sgt. Alex Licea)

The amount of network traffic in today’s wired and wireless infrastructure continues to rise at dramatic rates to keep up with voice/video/data services and real-time military applications. In both classified and unclassified military networks, line rates of 10 Gbps are commonplace and are expected to quickly grow to 100 Gbps in the next several years. These throughputs are largely the result of new IP-based network-centric warfare applications like real-time battlefield monitoring, video surveillance, and fully networked forces. The scalability of the network as a whole is a major risk to the effectiveness of any network-centric warfare program.

As network throughputs explode, we also need to be able to intelligently monitor our networks for exploits and to protect confidential data sources from breaches. An increasing threat to our homeland security is the growing number of high-profile cyber attacks on major military installations, our communications systems, government agencies, and financial markets and the resultant leakage of classified information and personal data. Even Google has been attacked recently in what could be an instance of state-sponsored corporate espionage. Compounding the problem, there are countless other attacks never publicized, but rather hidden by a web of obscurity for obvious confidentiality and national security reasons. The security applications responsible for protecting these critical resources need to keep pace with these increasing network throughputs with even greater network intelligence. Thus, communications equipment must provide complete visibility into network traffic at extremely high bandwidths by using content inspection to ascertain the nature of traffic, not just its destination.

Military networks already deploy an array of security applications to protect their classified and unclassified resources. These applications include virus scanning, firewalls, Intrusion Detection and Prevention Systems (IDS/IPS), Distributed Denial of Service (DDoS) mitigation, Data Loss Prevention (DLP), test and measurement, and network forensics solutions. These applications work almost entirely by providing Deep Packet Inspection (DPI) and flow analysis, looking for known patterns in network flows and blocking or recording them. With the need for application awareness, security processing, and DPI, the amount of processing power required for these computationally intense applications grows exponentially at these increasing line rates. However, these needs for increased visibility, throughput, and network processing power can be met by a heterogenous multicore processing architecture.

Heterogeneous multicore processing paradigm solves paradox

Network and security applications can generally be viewed in several distinct network architectures. Compute-intensive applications like intrusion prevention systems are deployed as active elements sitting directly on the network wire (inline) processing every bit of data that traverses the application in real time. These active security appliances need to operate at network line rates with very low latency. Computation that adds just microseconds of delay to network traffic can ruin the effectiveness of real-time end-user applications like sensitive military telemetry systems or Voice over IP (VoIP). Specifically, developers typically view 250 microseconds of delay that any inline network element can add (before end-user application performance degrades) as a high watermark in 1 Gbps networks.

Alternately, passive computing elements like network and computer forensics systems, intrusion detection systems, honeypots, and vulnerability scanners are not in the direct network path, but rather are deployed off a span port, network tap, or mirrored switch interface. These systems are responsible for collecting and analyzing terabytes of data from distributed sensors as would be typical in a battlefield node scenario. These passive monitoring devices can offer a thorough understanding of a network’s topology and which services are available, and scan to assess which vulnerabilities might be exposed on the network.

Network appliances deployed in either a passive or active network architecture share a common trait in that they must guarantee 100 percent traffic capture across all packet sizes to be effective. Missing any portion of data in a communications stream poses a large threat, making the overriding security application ineffective.

Meeting these performance challenges warrants a new approach to the development of the high-performance systems required by the intelligent network. Such systems need to be capable of analyzing traffic at all layers of the OSI model, from the data link layer (Layer 2) all the way into the application space (Layer 7) while performing this intelligent processing on all traffic at sustained throughputs of 10 Gbps and higher. Achievement of these goals requires specialized and varied processing elements, each custom designed for a specific type of workload computation.

A heterogeneous multicore architecture sets a new performance benchmark for embedded application development though separate and discrete processing elements for packet classification, stateful flow management, and application and control plane processing, each with increasingly fine granularity. This architecture tightly couples off-the-shelf Ethernet switch processors and network flow processor cores with general-purpose multicore x86 systems over a high-speed 40 Gbps, virtualized PCIe datapath. This architecture can be scaled from very low-end systems up to appliances offering hundreds of Gigabits per second of packet analysis, stateful flow monitoring, DPI, and application throughput, all with a common software architecture. Accelerated designs based on a heterogeneous multicore architecture can enable equipment providers to deliver high-performance, flexible systems that are up to four times more efficient than systems based on x86 general purpose processors alone with standard Network Interface Cards (NICs) as shown in Figure 1.

Figure 1: Designs based on a heterogeneous multicore architecture enable high-performance, flexible systems up to four times more efficient than traditional x86 systems and standard NICs.

(Click graphic to zoom)

Specialized packet, flow, and application workload processing

As shown in Figure 2, a distributed network acceleration architecture uses a multi-chip system to achieve maximum performance and application effectiveness. The three distinct processing stages function as shown.

Figure 2: A three-layered heterogeneous processing paradigm uses varied specialized processors to achieve maximum performance while keeping overall system costs low.

(Click graphic to zoom)

Ethernet packet processing

To heighten performance levels, off-the-shelf Ethernet switch processors are commonplace, offering up to hundreds of Gbps of configurable packet processing spanning the datalink, IP, and TCP/UDP packet layers. Traffic is classified on ingress and optionally filtered, cut-through to another network interface, or load-balanced across the Network Flow Processors (NFPs) that sit logically behind the Ethernet switch processors.

NFPs to accelerate higher-layer flow processing

NFPs containing a powerful array of microengine RISC processors are specialized, multicore devices optimized to offload burdensome workloads from general-purpose multicore CPUs. NFPs can handle lower-layer packet processing and accelerate higher-layer flow and application level processing. This accelerated architecture utilizes the network-optimized NFP cores for switching and routing, packet classification, stateful flow analysis, DPI, and dynamic flow-based load balancing. Other network processing functions such as TCP termination and SSL offload can also be performed on the NFPs and offloaded from the general-purpose CPUs. Traffic can be cleanly structured for transmission from the NFP to the general-purpose cores for application processing, thereby increasing host performance. Additionally, network flow processors provide hardware acceleration engines for PKI and bulk cryptography to assure line-rate throughput.

PCIe communications path to x86 cores

General-purpose multicore x86 CPU(s) in a system are optimized for application and control plane processing. From the network flow processors, packets are passed to the x86 cores across a high-performance, virtualization-aware PCIe communications path. An efficient zero-copy technique allows the transfers of packets directly into user-space application memory bypassing the operating system kernel, further accelerating application performance. Flows can be pinned directly to specific applications or load-balanced across parallel application instances to scale application performance.

As shown in Figure 3, through a cooperative set of software APIs between the x86 CPUs and NFP cores, the treatment of flows can also be updated in real-time, offering the ability to change the treatment of a flow after x86 analysis. This type of flexibility is essential in situations where a specific portion of a flow is of interest for inspection. After inspection is complete, all subsequent packets belonging to the flow can be filtered or cut through at the NFP layer, which conserves valuable PCIe bandwidth and reduces x86 CPU cycles. Through this heterogeneous architecture, the general-purpose multicore processors can focus on the compute workloads they are best suited for such as behavioral heuristics, Perl Compatible Regular Expression (PCRE) processing, content inspection, and analysis or other similar applications.

Figure 3: Via a set of software APIs between the x86 CPUs and NFP cores, the treatment of flows can be updated in real-time. This flexibility is essential when a specific portion of a flow is of interest for inspection.

(Click graphic to zoom)

Intelligent Networks at 40 Gbps

The mission-critical nature of our military networks driven by information-centric warfare and homeland security creates an opposing set of forces. Networks need to continue to scale to meet exponentially growing bandwidth demands, and enterprises need the ability to effectively monitor these networks at all packet and content layers with stateful network intelligence. To meet these needs, a distributed, multi-chip, heterogeneous multicore architecture is required, providing specialized workload processing to effectively scale applications to 40 Gbps and beyond.

Daniel Proch is director of product management at Netronome responsible for their line of network flow engine acceleration cards and flow management software. He has 14 years of experience in networking and telecommunications spanning product management, CTO’s office, strategic planning, engineering, and technical support. Previously, Daniel was with FORE Systems and remained with the organization through acquisitions by Marconi and Ericsson. Daniel has a BS in Mechanical Engineering from Carnegie Mellon and an MS in Information Science and Telecommunications from the University of Pittsburgh. He can be contacted at [email protected].

Netronome 724-778-3290 www.netronome.com