“Make all the mistakes you will, just don’t make the same mistake twice.” I have tried to hold myself and others to this advice throughout my engineering career. After all, we all make mistakes; it’s learning from them that counts, right?
In a SaaS environment, there is one thing that sustains your business’s life: availability. When you’re down or have an outage, you’re losing money and infuriating your customers. None of us can prevent this 100 per cent from happening; but we can work toward learning from our mistakes and reducing down time.
One of the best ways to discover what went wrong and, more importantly, what you can do to prevent it from reoccurring is a root cause analysis (RCA).
Like I said, in the SaaS world, availability is like air, neither we nor our customers can live without it. Availability is so basic that most requirement docs don’t even list it, but when there is an issue, watch out! The following table may surprise you.
|Percentage Available||Downtime Annually|
|99 per cent||88 hours|
|99.9 per cent||9 hours|
|99.99 per cent||45 minutes|
|99.999 per cent||5 minutes|
To have five “9s” of availability every year, the downtime (planned and unplanned) allowed is five minutes! It takes a lot of planning, money and thorough RCA to reduce your percentage.
Respond first, take notes, analyse immediately
When you have a production incident that causes an outage, it’s like a medical emergency – it is all hands on deck and you do whatever you need to do to revive the patient. While responding to an incident is another topic altogether, it’s important to put in place processes that document what occurs and what is done in response. Remind your team to take note of what is tried and what works to resolve the issue, so you have this data to review. Resolving the issue is top priority, but without documentation you’ll be doomed to repeat it.
Soon afterward, whether it was a full (system down) or partial outage (some features not usable or some users down) you need to conduct your analysis. RCAs undertaken too long after the incident are not effective, as the incident recedes from memory and loses its sense of urgency.
So, how do you conduct an effective RCA?
A good production incident RCA will address the following questions in three buckets:
How was the issue detected? Did your monitors correctly pick up the issue? How long after the issue started was it detected? Gather all the data around what happened, how and when it was detected and by whom. And note: hearing about it from customers is the worst way to find out.
Were the right people involved? How long did it take for them to respond and resolve the issue? Continuing the medical emergency analogy, did the doctors arrive on the scene and revive the patient quickly? Did they have the tools and access they needed? What mistakes were made, what could you do better next time?
Is there someone who should have been involved who was overlooked? How was the communication process with customer service and marketing? Did your team provide the necessary information to keep customers informed in a timely way? And the best part… now that you know what needs to be done to address the failure, automate it so it does not require a human to intervene the next time.
The above two buckets are obviously important, but this is the one that I am most interested in and ties to never making the same mistake twice. Do you have redundancy for all the layers – physical and logical? Are there defensive programming techniques that will make the system more resilient? Even if the failure happens again, can your system handle it without impact? Are there similar issues that could happen that could affect the system? What will you do different to prevent this from happening again?
Remember – availability is a life line for SaaS companies, and your root cause analysis will not only prevent the same issue from happening, it will empower you to pre-empt and prevent new issues from springing up.
Sunil Rajasekar, CTO, Lithium Technologies
Photo credit: Alexander Supertramp / Shutterstock