Military Embedded Systems

Covering fault scenarios in mission-critical military Ethernet applications

Story

June 21, 2024

Richard Tse

Microchip Technology

Covering fault scenarios in mission-critical military Ethernet applications

Time-sensitive networking (TSN) can be used in military systems to detect and contain the propagation of faults and to enhance the availability and integrity of time in an Ethernet network. When combined with hardening strategies, TSN enables Ethernet to be used for mission-critical applications in harsh environments.

Ethernet is rapidly being deployed as the primary networking technology for both military and commercial systems. To enable Ethernet to take part in mission-critical applications, it must be resilient to faults and – because it is a networking technology – this resilience must cover device-level faults as well as interdevice faults. While the probability of a device-level fault could be reduced by using hardened devices, improving the consequences of device-level and interdevice faults requires higher-level functions. Time-sensitive networking (TSN) provides such functions for Ethernet.

Ethernet is ubiquitous because it is simple, inexpensive, flexible, and constantly improving. However, its lack of intrinsic resilience to faults has limited its use in mission-critical applications. In harsh-environment scenarios, hardening strategies are used for mission-critical applications to reduce the probability of faults in a device. These device-level strategies do not enhance a network’s ability to deal with the consequences of such a fault, however, when it inevitably occurs. Higher-level mechanisms are needed for this.

The Time Sensitive Networking (TSN) Task Group of the Institute of Electrical and Electronics Engineers (IEEE) 802.1 Working Group has a charter to enable determinism in Ethernet networks. This determinism covers packet latency, latency variation, and loss, but also results in improved resilience to faults. The IEEE P802.1DP Task Group is defining a profile of TSN for aerospace onboard Ethernet communications, enabling Ethernet to service mission-critical applications. This profile might also be appropriate for the use of Ethernet in military applications like the U.S. Army’s Ground Combat Systems Common Infrastructure Architecture (GCIA), which integrates multiple systems and eliminates redundancies in service vehicles by using open architecture.

Traffic management and policing

Mission-critical networks are highly managed: The bandwidth used by each Ethernet stream (a unidirectional flow of data from a talker to a listener) is planned and constrained so that the combination of all streams does not cause congestion in the network. A single-event upset (SEU) could result in a “babbling” transmitter, which leads to the generation of excess bandwidth on a stream of a device and corresponding network congestion. This congestion could cause increased latency, latency variation, and packet loss on other streams. To prevent this, TSN offers per-stream transmission management at each transmitting port and per-stream policers at each receiving port.

One form of TSN transmission management is the credit-based shaper algorithm, known as CBS (which is specified by 8.6.8.2 of IEEE 802.1Q-2022). The CBS function uses credits to create a stream whose bandwidth usage is steadily paced and constrained. Credits are added at a configured rate, X, when no frame from that stream is being transmitted. Credits are added at a rate of X – Y, where Y is the rate of the Ethernet link, when a frame from that stream is being transmitted. Transmission of a frame from a stream is not initiated unless the stream has a non-negative credit value.

The policer that corresponds to CBS is the per-stream filtering and policing (PSFP) flow metering function (which is specified by 8.6.5.5 of IEEE 802.1Q-2022). It discards frames from a stream when the stream exceeds its configured average or peak bandwidth rate.

Another form of TSN transmission management is commonly known as the time-aware shaper (TAS) (which is specified by 8.6.8.4 of IEEE 802.1Q-2022). TAS creates time-division multiplexed transmission windows and places frames from a stream (or from a set of streams) into their designated windows. Streams in one window are isolated from streams in other windows.

The policer that corresponds to TAS is the PSFP stream gating function (which is specified by 8.6.5.4 of IEEE 802.1Q-2022). It discards any frame that is not entirely contained within its designated window.

Figure 1 shows sets of CBS, TAS, and PSFP functions associated with some Ethernet devices. The ordering of the CBS, TAS, and PSFP functions shows that a fault in a transmitter that causes excess transmission bandwidth and/or misaligned transmissions on a stream is immediately policed at the adjacent downstream receiver, which stops the fault from affecting other streams.

As seen in Figure 1, if the CBS function for stream 1 erroneously transmitted at 6 Gbps and was not policed, the 10 Gbps egress port of Ethernet Switch 1 would be overwhelmed with 11 Gbps of traffic. The resulting congestion on this egress port could affect streams 2 and 3. Fortunately, the PSFP flow metering function in Ethernet Switch 1 polices stream 1 and limits it to 5 Gbps.

[Figure 1 ǀ CBS, TAS, and PSFP functions in Ethernet devices.]

Analogously, a good transmitter combined with a faulty policer would not have a deleterious effect on other streams because the transmitted traffic is already behaving properly.

Lastly, the CBS and TAS functions are in a different device from their corresponding PSFP flow metering and stream gating functions. It is improbable that both the transmission and policing functions would suffer faults on the same Ethernet stream at the same time.

Time availability and integrity

Many modern applications are time-sensitive, so the sharing of a common time-of-day (ToD) is important. For mission-critical applications, the availability and the integrity of this ToD are vital.

ToD can be distributed accurately (within sub-microseconds) by Ethernet using the mechanisms from IEEE 802.1AS, Timing and Synchronization for Time-Sensitive Applications. These mechanisms are commonly referred to as PTP [precision timing protocol].

While PTP has been used successfully in commercial applications, its lack of inherent abilities to immediately overcome a fault and to check the integrity of the received ToD has limited its use in mission-critical applications. These issues are addressed by work that was started for IEEE 802.1DP and is, ultimately, expected to be finished in an amendment of IEEE 802.1AS. The concepts of this work are discussed here.

A typical connection from a PTP Grandmaster Clock (GM) to a clock target is shown in Figure 2. The availability of the ToD could be temporarily lost if one of the PTP Relay Instances or one of the links fails and causes the connecting path to the clock target to change. The integrity of the ToD at the clock target cannot inherently be trusted because it could be undetectably corrupted by any of the PTP Instances in the connecting path or even by the GM itself.

[Figure 2 ǀ Typical connection from a GM to a clock target.]

To increase availability and enable integrity, multiple (two or three) GMs connect to the clock target through independent paths and through a fault-tolerant timing module (FTTM, not yet standardized). An example with three GMs distributing ToD independently through multiple PTP Relay Instances, PTP End Instances, and an FTTM to the clock target is shown in Figure 3.

[Figure 3 ǀ Independent connections from three GMs to a FTTM and clock target.]

This solution increases availability because multiple instances of ToD are available through the FTTM to the clock target. If one is lost due to a fault, another is still available.

Additionally, this solution enhances integrity because the multiple ToDs can be compared. An error in one can be detected by its deviation from the other(s). The GMs are synchronized to each other using a time-agreement generation function that is resilient to Byzantine faults, thus keeping them independent of each other. [Note: A Byzantine fault is a state of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed.]

Use of independent network paths eliminates potential common mode failures in the ToDs delivered to the FTTM. The FTTM compares ToDs, determines which ToDs can be trusted, and selects a trusted ToD for the clock target.

Richard Tse is an Associate Fellow-Architecture at Microchip Technology with extensive experience in semiconductor devices and network architectures. He was the chair/editor of IEEE 1914.3, Radio over Ethernet Encapsulations and Mappings, and contributes to IEEE 802.1, IEEE 802.3, and IEEE 1588 standards, with multiple contributions on the fault-tolerant timing module for TSN. He holds an M.Sc. in electrical engineering from the University of Alberta in Canada.

Microchip Technology • https://www.microchip.com/

Featured Companies

Microchip Technology

2355 West Chandler Blvd.
Chandler, AZ 85286
Categories
Comms - Satellites
Topic Tags