It’s been nearly a decade since Netscape co-founder and venture capitalist Marc Andreessen said: “Software is eating the world.”
It’s true. When you think about it, these days, it is hard to imagine any part of modern life, from ATMs to oil rigs to vacuum cleaners, that isn’t enabled or controlled by code. But sometimes, with errors and crashes and outright disasters, it can also seem at times like it's the software that’s getting eaten.
If it seems like software failures are getting more commonplace, there is a very good reason. Increasing complexity goes unseen by end users but, with simple applications like Google’s Chrome running up to millions of lines of code, large-scale software, in the hands of millions, exposes the oversights of developers.
And there have been some catastrophic fails...
Boeing rushed to take-off
What: Boeing’s 737 Max aircraft was supposed to be the next generation of a well-tested workforce that would help the airliner maker compete with Airbus’ A320neo. But two crashes killing 346 people cast doubt on its design as well as on the features purported to make it easier for pilots to fly.
Why: Under “undue pressure” to compete with Airbus, Boeing cut costs including by reducing work hours on regression testing (the practice of re-running tests under slightly different circumstances) by 2,000 hours. It also installed only a single Angle Of Attack (AOA) sensor, despite warnings, to skirt new regulatory and training requirements, according to a House transportation committee report on the failings. When this AOA sensor sent faulty data to the aircraft software, the plane’s nose was forced downward, and pilots who were unaware of the system’s existence were unable to mitigate the malfunction.
Toyota ‘botched’ software processes
What: Off-duty highway patrolman Mark Saylor and three family members were among the first known victims whose fatal high-speed car crashes caused by a jammed accelerator pedal. Eventually, nearly 100 lives were lost. The manufacturer, Toyota, eventually paid a $1.2 billion fine and recalled millions of cars, including Lexus and Pontiac vehicles.
Why: An initial NTSB investigation found no electrical defect, and instead blamed a combination of the drivers and incorrectly-installed floor mats. However, a longer investigation followed, and after 18-months, embedded systems expert Michael Barr revealed the litany of problems he’d uncovered. Toyota’s software had failed to perform run-time stack monitoring, “botched” its worst-case stack depth analysis, used “spaghetti” code, critical variables weren’t prevented for corruption and more besides, remarking: “There is a process in place for hardware but not software.”
Democrats’ damaged deployment
What: When the Iowa Democratic Party decided to tally 2020 Presidential caucus results via a mobile app. After all, transitioning away from the traditional way of reporting, which was via phone call, seemed like a great idea. But, when the count came, it was found that the app, built reportedly for $60,000, had not transmitted results properly. Party workers who then attempted to call in their results were confronted with jammed phone lines, then jammed back-up phone lines, and some eventually delivered their counts manually on pieces of paper.
Why: The party confessed to “coding issues.” Some of those were not found initially because the app was deployed to users through TestFlight and TestFairy, suites designed to give apps to beta testers, rather than through official iOS and Android app stores. But Apple and Google would only have found the most critical of those bugs. The app’s real failure? Bad project management which did not subject the code to test. No wonder the app was deployed to users with a hotline in case it “stalls/freezes/locks up.”
McDonald’s glitchy give-aways
What: Not all software issues kill (people or democracy). In 2019, two Australians made headlines for a trick they discovered with the pricing at the fast food giant’s ordering kiosks that netted them a free burger. It wasn’t a hack as it was widely called. They just exploited the system to their own benefit. But, a quick search of the internet reveals this isn’t the only time McDonald’s has unwittingly given food away. It seems users of the app are often able to add items priced at $0.00 to their carts.
Why: There is no such thing as “system failure,” only human failure. The Australians simply did the math - if an item price is reduced by more than the cost, you get free stuff. In the case of the app, it seems the freebies happen after a person either doesn’t program a price on an item or they neglect to disable ordering for an item that isn’t on the store menu.
Compensating for failure
As these examples show, errors happen for a lot of different reasons. Sometimes there’s too little time and not enough resources. Other times, it may be from lack of will. Or excess optimism. And it may sound funny, but, honestly, tech teams would be served well by adding more pessimism into their thinking.
Yahoo puts it a little differently - members of its security team were christened “Paranoids” - after all, worrying about breaches is the best way to prevent them. But, after a company I worked for was acquired by Yahoo, I saw first-hand how coders outside of just security deployed “paranoia” programming, thinking outside the box for app use scenarios that some people may otherwise consider unimaginable but which may be the source of the next major bug.
When you’re paranoid, there is strength in numbers. Pair programming, an agile development technique in which an observer programmer reviews each line of code entered by a primary developer, boosts code quality because “programming out loud” - discussing the merits of each method - leads to a clear articulation of sometimes-complex tasks, reducing the likelihood of bad practice.
And, just as developers can double-up, systems integrity can often be enhanced by installing duplicate or triplicate components. In the case of 737 Max, Boeing allowed its plane’s MCAS software, designed to push down the plane’s nose in certain conditions, to rely on a single Angle Of Attack sensor for automatic activation, despite warnings that MCAS was vulnerable to single-AOA failures. A more rigorous approach, for Boeing and many other kinds of tech teams, would be to allow software to see as many flags and fallbacks as possible,
Developers are only human. But they also sit at the helm of some of society’s greatest and deadliest systems.
As software becomes ever more complex and ever more vital, it is important that programmers accept their fallibility and seek to compensate - before the bugs run riot.
Paul Belevich, founder and CEO, QA Supermarket