WhatsApp, what’s down? Downtime you wouldn’t expect in 2017 & what needs to change

Wednesday’s WhatsApp outage tells it all: if it can happen to Facebook’s baby, then no site or application is safe. The failure of such a high-profile communications platform highlights the potential ramifications of downtime in an increasingly connected world. With ever more demanding consumers, it’s never been more important to properly prepare, and test sites and apps to ensure they cope with the strain. Here are five downtime casualties so far in 2017, and a word of advice for each: 

WhatsApp 

On Wednesday, the world’s most popular cross-platform instant messaging application went down, leaving many of its 1.2 billion users unable to send or receive messages. It’s believed that an update to the app caused the downtime. 

What needs to change? 

Testing is a vital process in the development and release of apps. Sending a product to market that doesn’t function correctly means missed business opportunities and potential extra costs for your company. Therefore it’s important to establish an effective test process, allowing you to deliver application updates on time with no dip in quality. 

Automation is a key initiative here; if you commit too many man-hours testing an app, updates will arrive late and you are more likely to miss a critical test due to release deadlines. The end result - a broken product that causes widespread problems for your users. 

Ryanair 

On March 22nd, Ryanair went down for 8 hours for scheduled maintenance. This left users unable to access the airline’s site to purchase tickets, or to check in online. 

This is the price you pay if you have a site design and mindset that allows you to miss out on 8 hours of high volume e-commerce simply because your site needs an upgrade. Apparently the aim of the upgrade was to improve the site as it attempts to become ‘the Amazon of travel’, but it is unlikely Amazon would shut down for even a minute in order to do a system update, not least because it would upset customers.

What needs to change? 

Most online systems today make upgrades using two separate systems, switching transparently when the new system is ready to take over. Ryanair didn’t follow this rule likely due to old backend and database dependencies. Hopefully they have now fixed the problem permanently, meaning this dilemma never appears again in the future. 

Leading SaaS providers have made constant service into an art form; they never have to perform hard shut downs. By using a maintenance window, they can always upgrade on the fly – something Ryanair could certainly learn from. 

Virgin Money Giving 

The recent website crash before and during the 2017 London Marathon experienced by Virgin Money Giving has already proven to be a very expensive mistake, with the company adding 10 per cent to donations to make up for the site’s unavailability. 

In the short-term, it prevented friends, colleagues, and relatives from personally supporting those who were running, which no amount of Virgin Money Giving contribution can replace. Going forwards, such an outage will create a murmuring of distrust around whether or not the website will stand-up in the future. 

What needs to change? 

The fact is, the London Marathon is one of the most in-demand and widely supported events in the world and having a surge of web traffic should come as no surprise. 

On the contrary, it should be expected, and Virgin Money Giving should have looked at last year’s peak numbers to gauge estimated growth and load test expected traffic to ensure that when the site was needed most, it could withstand the pressure. At the very least, there should have been a defined action plan for overload. 

Reddit

The internet went crazy a few weeks back when Reddit fell over for a few hours due to scheduled fixes, with users taking to social media to voice their disappointment. Downtime is actually quite a frequent occurrence on Reddit, with the site typically unavailable for very brief periods of maintenance. Other, unexplained, outages could well be due to sudden high traffic peaks.

What needs to change? 

Like many sites, Reddit could do with investing in load testing, to pinpoint performance concerns before they begin to impact end users. Unlike sites dedicated to handling known periods of high volume, Reddit experiences variable and unpredictable volume, so the key is concurrent user testing. 

This determines what it takes to make the site crash. Sites can they plan how to get the site to bounce back again from an overload situation. Implementing some kind of queue functionality that protects the site from going down is a popular solution.

British Airways 

Less than a month ago, the British Airways website went down for eight and a half hours due to a crash caused by an overnight database upgrade. Customers were unable to check in or access any flight details via the site. 

What needs to change? 

Planning essential site maintenance for non-peak times is a smart move, but in most cases outages don’t ‘just happen’ - there are symptoms of infrastructure failure that companies should always be looking out for. 

Test process and failover should be implemented and more importantly be tested under real conditions.  Website and application monitoring has an active role in gauging performance and can alert companies to problems with their platforms before they become major issues.

The modern consumer environment is an incredibly demanding one for businesses. Due to the ‘uberisation’ of services, whereby users expect service when and where they want, there is no allowance for a breakdown in supply. 

Downtime just isn’t an option for any company wishing to maintain a strong market reputation, and industry practices and approaches have to change to meet this imperative. Instead of scheduled maintenance, which drives away users and their accompanying revenue, providers must upgrade on the fly. And when it comes to monitoring and testing, their mantra should be one of frequent, rigorous examinations of how their websites or apps will perform under pressure. 

To maintain a competitive edge and keep their customers happy, everyone has to realise that knowledge really is power, and in terms of performance issues, they have to act sooner rather than later.

Sven Hammar, founder and Chief Strategy Officer, Apica
Image source: Shutterstock/hafakot