Today's networking systems are inherently complex - different protocols and different hardware brands can mean the software used to carry them has to be sophisticated and up to date.
However, different protocols and systems leads to millions of permutations that cannot all be tested. Slightly different hardware systems, running in all kinds of environments, can lead to slight variations that unfortunately can easily trip up the software.
When critical intermittent software bugs are located in a system this complex, it can take weeks of debug time using old-school methods of logging the issue and calling out a software engineer to fix the problem.
However when looking for context and at industries that have succeeded in capturing failures immediately and learning inherent lessons, there is no better example than that of the ‘black box’ flight recorders used in airplanes. Although slightly morbid, given they record the flight broadcasts and data of aircraft often in peril, the information collected is essential for pinpointing exactly where a critical error may have occurred. Given the precise nature of locating an error, why can’t the same ‘black box’ philosophy be applied to software programs susceptible to extremely rare, yet commercially crippling, crashes and mistakes?
The problem with bugs…
Software engineers are unanimously agreed that there is no such thing as bug-free software. This is true of the most simple programs, such as a simple calculator application, through to the most complex multi-threaded databases that power cloud services across the globe. Complex programs with hundreds, if not millions of lines of logic, with their execution all closely entwined, mean that many pieces of shipped software are likely to contain bugs.
The trend is not new. For years, managers of software development teams have made trade-offs between the pressure to ship features and code quality: should they spend extra time trying to fix that really annoying bug that only appears once in every 300 runs, or should they stick to the software delivery schedule? Most often, the failure will be tossed into a pile of undiagnosed tests that becomes a backlog and these failures will at some point raise their heads.
Bugging woes of Amazon and Salesforce
Pinpointing bugs is itself a challenge and a seemingly impossible one when they are virtually irreproducible. During the testing phase, bugs may only subtly affect a program (if they appear at all) and barely display any effect on the outputs. But, when they manifest in production, the consequences can be severe for businesses - such as when Salesforce’s CEO had to apologise directly to US users when a file integrity issue (opens in new tab) made the database inaccessible for days. Or, such was the case when Amazon’s database couldn’t handle a slight database disruption which consequently caused outages throughout the Amazon network (opens in new tab).
In the most extreme examples, this lack of confidence in a debugging environment can mean plummeting stock prices, loss of market share and evaporating customer trust and loyalty.
Applying ‘black box’ theory with record and replay technology
After the examples and pitfalls of companies failing to debug, it would seem there is a very bleak picture being painted for the integrity of software systems.
However, using the ‘black box’ theory, explained above in the example of aviation, there are ready made solutions. The solution is for quality assurance and software development managers to couple rigorous testing with rigorous debugging. The revolution in testing has already happened, as thousands of automatic tests can be run simultaneously in an attempt to test code from many angles, but the debugging revolution is only beginning. If you are a manager that doesn’t want to be in the firing line for the potential loss of a major customer, consider your debugging strategy and what you can do to be more confident that you are not releasing a disaster.
The ability to capture and replay program execution is one solution to the problem of irreproducible test failures and the only viable means by which released software can be the cleanest it can be. The premise is simple. You can take an exact recording of a program’s execution so that you can capture an exact replica of a failing run. A recording represents a 100 per cent reliable reproducible test case that offers total visibility into all the factors that led up to (and caused) the crash. This means you no longer need to fear that a sporadically failing irreproducible test might mean the loss of a $10 million account because you know that the failure can be captured and fixed before making it into production. This is where replaying the recording comes in. Rather than stepping line by line through code to try and identify the exact piece which failed, a better method of interrogating the program is available; one that maximises efficiency and allows developers to debug quickly.
Static and dynamic analysis tools can detect certain classes of problems - for instance, they can help developers find implementation bugs - but they cannot detect all of them. For example, they are of no use for more serious bugs in runtime code, for which only traditional debug methods remain - such as core dumps and log files. Recording and replaying program execution is the obvious sequel to the testing revolution. It should become the new standard at which debugging protocol is set if managers truly want to prevent the next disaster.
Better yet, if recording tools were more widely used, new and emerging technologies and industries would be able to share learnings and best practice, ensuring the industry tackles costly and potentially dangerous failures together.
Why record and replay technology is here to stay
As alluded to, recording a failed process cuts out the guesswork entirely of where a mistake may have occurred. The failure is 100 per cent reproducible. A reproducible test case is obtained instantly via a recording artefact, which can be replayed in a reversible debugger to step forwards as well as backwards through the code, so engineers can find their way straight to the problem. This process is vital to pinpointing, and learning from, any acute errors.
Ultimately, obtaining a standalone test case of a failure that you can debug quickly from anywhere, offers the networking industry a way to mitigate the risk of networking outages and security breaches that could cause havoc on customers’ sites, jeopardise customer relationships and impact revenue.
Barry Morris, CEO, Undo (opens in new tab)
Image source: Shutterstock/niroworld