Should we put up with software that doesn't work?

Story

September 02, 2011

Robert Dewar

AdaCore

We are used to software that dismally fails. What is surprising is that we accept this as reasonable. It is time to stand up and say we are not going to put up with this anymore. There is no excuse for junk software.

To me, it is surprising how tolerant people are in general of the idea that complex software cannot be expected to work reliably. If we open the news stories for any particular week, we are almost certain to find one story about some computer error that resulted in an undesirable outcome. One such story that caught my eye recently had the headline “450 ‘high-risk’ prisoners released in California after computer glitch.”

One thing that strikes me about this headline is the use of the word “glitch.” One online dictionary defines “glitch” as: “A minor malfunction, mishap, or technical problem; a snag.” Oh really? Minor? What is happening here is a very typical acceptance of the idea that computer programs cannot be expected to work reliably. Anyone using a Windows-based computer is probably used to the software crashing regularly or otherwise malfunctioning, and they come to expect this level of unreliability.

This dismal level of expectations is not just something that affects non-experts. At an NYU computer science meeting, one of the faculty members teaching programming casually mentioned that all large computer programs contained serious errors. On another occasion, I listened to a presentation from an eminent law professor from Yale Law School arguing that product liability standards needed to be modified for software because all software contains errors; so it is unreasonable to hold manufacturers to “normal” standards of product reliability. Indeed, in other areas we do expect much greater reliability. If a toy manufacturer made a mistake that resulted in dangerously high levels of arsenic in some toy, I very much doubt that the manufacturer would try to pass this off as a “glitch.”

Let’s take another story from recent headlines. A nefarious group of characters (perhaps associated with the Chinese government if reports are to be believed) has allegedly launched an attack on computers belonging to high government officials in the White House and elsewhere. Now in looking into the details of these attacks, they appear to principally be instances of “spear phishing,” which are phishing messages specially tailored to the recipient. So the recipient might, for instance, get a message supposedly from the boss asking the recipient to look over minutes from a meeting. Recipients clicking on this attachment may find their computers compromised with dangerous malware. Now most of us react to such news by wondering how these high government officials can be so “stupid” as to click on these messages. But is that fair? Shouldn’t we perhaps be taking a closer look at the defective design and implementation of the underlying operating systems that allow such simple-minded attacks to succeed?

When safety matters

Clearly what accounts for this low level of expectation is the experience people have day to day with computers. But actually if they really knew all the interactions they have with computers, they might either be scared out of their wits, or perhaps be impressed that not all computer programs are unreliable.

In particular, whenever anyone climbs aboard a modern plane, they are trusting their lives to millions of lines of complex avionics code. Yet we have a remarkable safety record here: No life has been lost on a commercial airliner because of a bug in the implementation of avionics code. How is this achieved? By the use of rigorous techniques embodied in the DO-178B standard, we have not achieved perfect reliability (we have had some close calls), but certainly software is not the weak link in the chain when it comes to airline safety. Now I should mention in the interests of full disclosure that AdaCore is in the business of supplying tools for building software of this kind, so we have a vested interest in reliable software. I don’t apologize for that. On the contrary, I would like to see us and other companies like us succeed in convincing people to use DO-178B or similar software reliability techniques in other areas.

What about other areas? Two examples that come to mind immediately are modern automobiles and medical instruments. In the case of cars, it is on the one hand surprising that typical modern cars often have more lines of code aboard than modern commercial aircraft (Figure 1). On the other hand, it is surprising that nothing like the rigorous DO-178B standard is applied to automobile software, which is largely regarded as proprietary and not subject to the same kind of outside scrutiny. Have software bugs caused car accidents? We really don’t know since manufacturers maintain a high level of secrecy. Should we worry about the future? Well, I certainly do.

Figure 1: Millions of lines of code onboard: Do you trust your car?

(Click graphic to zoom by 1.9x)

Similarly for medical instrumentation, we have very complex software at work, which is also not controlled by rigorous safety standards. Have people died because of defects in such software? Yes, a number of deaths have resulted from radiation instruments delivering excessive radiation because of programming errors. It would seem that we need to tighten up the controls in this area considerably. But surprisingly, the reaction to these deaths has been somewhat muted. I think this is due to the general malaise of thinking that it is to be expected that software will have errors, so we can’t get too upset about it.

Room for improvement

Going back to the avionics example, which we held up as an example showing that high reliability can be achieved, it is important to repeat that we are not perfect. At least two issues remain. First of all, we still do find bugs occasionally. So far these have not been fatal, although, as mentioned, we have had close calls – including the case of a Malaysian flight where the engines had to be turned off and restarted mid-flight because of a software defect not caught by the DO-178B certification process. No one was injured, but the plane lost 15,000 feet in altitude before the restart could be accomplished, and for sure there were a lot of scared crew and passengers. So we need to do even better.

We do have ways to further improve the process. In particular, the use of formal mathematical methods is getting much more practical. As an example, the iFacts system in England, which provides new land-based air traffic control, uses software that has been formally proved to be free of runtime errors (the kind of thing that leads to the pervasive buffer overrun problems that plague C and C++ programs). This software is written in an Ada-based language called SPARK, which is particularly conducive to mathematical reasoning, and Altran-Praxis has produced a suite of tools that allows this kind of approach. Other companies are pursuing similar mathematical formal approaches. So we definitely have paths to further improvement. It’s also important to note that these techniques are perfectly well applicable to COTS; there is no need to think that reliable software requires expensive customized approaches.

Secondly, we need to note that figuring out how to write highly reliable software does not guard against the situation in which someone makes an error in the specification of the software, and the resulting program faithfully does the wrong thing. An example from the avionics industry is a September 1993 accident in Poland. During the landing, the aircraft’s spoilers did not deploy in time to prevent a crash in which several lives were lost. In this incident, the software was faithfully implementing a requirement that the spoilers could not be deployed unless the airplane was on the ground, as indicated by its wheels turning or having weight on the landing gear. Unfortunately, the runway was wet and the wheels were skidding and not turning, and wind shear caused a light landing. In retrospect, the specification was incorrect, and the pilot should have been able to override the wrong decision of the software.

It’s impossible to guard against such errors entirely, but the process of carefully formalizing the specifications as is needed for the certification process helps to find such errors; and for sure, eliminating normal programming bugs would go a long way to improving things. That in particular would surely have kept those 450 dangerous criminals locked up where they belong.

Don’t get even, get mad

An important first step is for everyone to decide that it is not acceptable to put up with errors in programs. We need to be much more indignant when these avoidable errors occur. It’s time to echo the sentiments expressed by character Howard Beale in the movie “Network” in his infamous rant and all say, “I’m mad as hell and I’m not going to take this anymore!”

Dr. Robert Dewar is Cofounder, President, and CEO of AdaCore and has had a distinguished career as Professor of Computer Science at the Courant Institute of New York University. He has been involved with the Ada programming language since its inception and led the NYU team that developed the first validated Ada compiler. He has coauthored compilers for SPITBOL (SNOBOL), Realia COBOL for the PC (now marketed by Computer Associates), and Alsys Ada and is a principal architect of AdaCore’s GNAT Ada technology. He has also written several real-time operating systems. He may be contacted at [email protected].

AdaCore 212-620-7300 www.adacore.com