ROcE protocol brings 40 Gbps Ethernet to VxWorks

Story

March 16, 2015

Network performance can be an issue for VxWorks, especially in demanding applications like signal/image/radar processing. However, designers can get full-speed 40 Gbps Ethernet out of VxWorks.

For many defense and aerospace system integrators, Wind River's VxWorks retains its stature as the operating environment of choice thanks to the real-time operating system's (RTOS's) trusted and proven deterministic performance. One area of frustration for system designers, though, is VxWorks' network performance when used with faster variants of Ethernet. VxWorks works great when used with Gigabit Ethernet (GbE) links since the VxWorks network stack is able to support full line-rate performance using protocols such as TCP and UDP. As the network pipe increases to 10 GbE or 40 GbE, however, throughput becomes limited to around 4-5 Gbps even with today's fastest processors, due to inherent network architecture limitations. At these speeds, VxWorks network performance is often far below that of competing Linux systems.

For many demanding embedded applications, such as signal/image/radar processing, VxWorks' limited 40 GbE throughput can reduce its appeal while the use of Linux may not offer the real-time responsiveness of VxWorks. For those designers who prefer VxWorks, and for those customers who have extensive legacy investment in pre-existing VxWorks applications – sometimes measured in millions of lines of code – there's a strong demand for faster Ethernet performance under VxWorks.

The good news is that now there's a way to achieve full-speed 40 Gbps Ethernet over VxWorks on today's highest performance 4th generation Intel Core i7 processor-based OpenVPX digital signal processor (DSP) engines and single-board computers (SBCs). The solution is an alternative network stack that makes use of the high performance, low latency, and low CPU overhead features of the RDMA over Converged Ethernet (RoCE) protocol. Until now, RoCE (pronounced "Rocky"), developed by the InfiniBand Trade Association (IBTA) in mid-2010 to ensure efficient, low-latency data transfers between servers in commercial data centers, has been available only for the Linux community.

The RoCE-based stack's breakthrough performance – almost 10 times over the standard VxWorks Ethernet interface status quo – results from porting an ibverbs interface to the board's VxWorks standard board support package (BSP). Use of the ibverbs API enables the system integrator to program the board's Mellanox ConnectX-3-Gigabit Ethernet network device to deliver previously unobtainable levels of bandwidth with VxWorks. The key is remote direct memory access (RDMA), which supports direct data movement between the application memories of multiple CPUs without requiring any CPU involvement. Previously, RDMA could only be used with InfiniBand fabrics, but RoCE brings RDMA to Ethernet networks. Using RoCE, Ethernet handling overhead is reduced to near-zero (~1 percent). According to the IBTA, RoCE "provides the best of both worlds: the familiarity and ubiquity of Ethernet combined with the features and efficiencies of InfiniBand."

With a RoCE-enabled BSP, boards can achieve near full-line-rate 40 Gbps data throughput over each Ethernet port (recent measurements: ~38.7 Gbps over a single 40 GbE Ethernet port). This solution also eliminates the need for additional components, such as a TCP offload engine (TOE) to boost Ethernet performance over VxWorks.

While a TCP/IP network can be used to send large amounts of data, the need to inspect each Ethernet packet can detrimentally affect system latency. In critical defense applications, latency can be a matter of life and death. RDMA delivers the lowest possible latency for bulk data transfers. Compared to using TCP, RDAM is ideal for moving large bulk data from one processing node to another where maximum bandwidth, minimal latency, and minimal CPU overhead is desired. RoCE uses RDMA to transfer large amounts of data directly from one node's application memory to the application memory of another, thereby bypassing the network stack and its processing speed limitations. This technique eliminates the need for the CPU to "process" data messages transferring between nodes. These bulk data transfers are typically used for HPEC applications where massive amounts of data are processed and transferred between nodes for such applications as radar, graphics, and image processing.

TCP or UDP protocols do provide a superior approach when two nodes need to communicate to exchange interprocess communications, which requires CPU interpretation such as messaging, interprocess coordination and signaling, etc. Fortunately, the RoCE data is packaged and transmitted as standard Ethernet packets so both types of data communications, RoCE and TCP/UDP, can coexist across the same Ethernet pipe.

Examples of OpenVPX products available today that support an RoCE-enhanced VxWorks BSP include Curtiss-Wright's Fabric40-based CHAMP-AV9, which supports VxWorks; and the VPX6-1958, a Haswell-based single-board computer (Figure 1). Both of these products feature Fabric40 10 Gbaud performance which was developed by the company's recently announced "Bicycle Shop" technology incubator. The RoCE-enabled software driver is provided at no additional cost in the boards' standard VxWorks BSP.

Figure 1: Curtiss-Wright's 10 Gbaud Fabric40-based CHAMP-AV9 multi-processor 4th-gen Core i7 ("Haswell") DSP engine [shown

(Click graphic to zoom by 1.9x)