Lessons learned from the AWS outage

Multiple outages continue to prove that the internet is not impervious to failure. The most recent of these, involving Amazon’s S3 storage solution, was caused by something we are all familiar with in computer operations: operator error. The three-hour outage began with symptoms of increased error rates for requests in the US-EAST-1 region. It eventually affected prominent websites and applications like Slack, Trello, Netflix, Reddit, Quora and others.

While no one has achieved the panacea of 100 per cent uptime, there are best practices that application developers should keep in mind that will greatly reduce the likelihood of a business-impacting outage.

Redundancy reduces the risk

Cost savings and scalability versus the traditional data centre are just two of the significant advantages of cloud computing. Major industry players like AWS and Microsoft entice customers to bring an existing application stack to their bundled Cloud suite of app, compute, storage, network, and analytics. The convenience of this single provider solution is outweighed by the potential risk of having a single point of failure in your application stack.

Recently, due to mergers and acquisitions, cloud service providers have been consolidating, creating a potential single point of failure as distributed vendor solutions suddenly becomes single vendor solutions. As a result, it is becoming more difficult to diversify the technology stack, moving away from a single threaded setup.

The past few months have made plain the fact that major outages are no longer a rare occurrence. Organisations cannot ignore the financial and reputational impacts of outages; they are well documented. Gartner’s Andrew Lerner suggests that enterprises lose, on average, £240,000 for each hour of unavailability. Using this estimate, the AWS S3 outage cost customers roughly £720,000 for three hours of downtime. Last year, Dyn’s 18-hour outage, due to the massive Mirai botnet, likely cost the company’s larger customers approximately £4.3 million each.

Making high availability work

In light of cost and reputational damage, and because companies want to offer an excellence experience to their customers, the goal is to operate and maintain a highly available application. There are four core components involved:

  • Redundancy – a primary or secondary system waiting in the wings to take over the job if another system that performs the same function goes down.
  • Scale – the ability to automatically provision new infrastructure based on load.
  • Routing – redirecting traffic to the best end point, and redirecting traffic to another end point if the primary endpoint is unavailable.
  • Backup and Recovery – the ability to restore data, configuration and other functions to a pre-event state.

Six keys highly available application

1. Determine which components require a high-availability setup. 

There are significant technical, operational and financial challenges to building high availability into the application stack. This makes some components better candidates than others. Application developers should develop scoring mechanisms to determine which components to address first based upon: 

  • Cost, time and ease to complete the project.
  • Impact that a failure will have on users, the application and the business.
  • Likelihood that the component will fail.

2. Consider the risk third parties introduce.

It’s almost a Catch-22: your cloud application is reliant on other cloud applications in order to service your customers. While you may architect your application with the proper routing, scale, redundancy, and backup and recovery systems, if your application leverages a cloud service—and it is likely that it does—that cloud service also needs to follow established best practices for highly available cloud application.

Risks inherited from cloud service providers (CSPs) that impact your application are, from your customer's perspective, third-party risks. Managing third-party risk, in many ways, comes down to trust. When your application goes offline due to your CSP, you lose trust in your provider – and your customers lose trust in you. Invest in network and application monitoring solutions that can help you evaluate your CSP’s performance.

3. Figure out which data and components go where. 

Not all systems and data should be deployed onto a cloud solution. Your business may have compliance or regulatory considerations, sensitive data or the plain need for more control over your data. This may mean that parts of your application need to be set up in a more traditional or hybrid model.

 4. Be prepared for failures. 

When the AWS outage occurred, a disruption happened in only one AWS service in a single region. However, the impact was widespread. Preparing for failures at different levels of your architecture can help you avoid an outage.

  • Preparing for cloud failures can be the most challenging. If your cloud solution is a single point of failure for your network and it goes down, you go down. Where possible, implement a hybrid IT set-up for critical services and, where technically and financially feasible, have a backup cloud service provider.
  • In preparation for zone failures, make sure you have a least two zones and that you are replicating data across those zones. In addition, global load balancing can help you automatically route traffic away from a zone that is down and to a zone that is up.
  • Even with proper maintenance, servers can go down. Ensure you have auto-scaling, internal load balancing and database mirroring in place.

5. Look for advanced DNS capabilities.

To save users from having to memorise IP addresses, DNS uses domain names like “google.com” to send users and application traffic to the proper endpoint. Intelligent DNS solutions can be used to dynamically shift traffic based on real-time user, network, application, and infrastructure telemetry – including if a component of your infrastructure, like AWS S3, goes down. Intelligent DNS will ingest the telemetry and automatically reroute the application’s traffic to a healthy IP address.

Your technology stack should include DNS with health checks and automated traffic management capabilities. Additionally, make sure your DNS does not become a single point of failure. To truly have a highly available cloud application, you need to architect your application in a redundant DNS setup.

 6. Get redundant DNS to avoid the single point of failure.

Though significant built-in redundancy and fault tolerance are offered by next-generation managed DNS systems, all managed DNS providers have experienced problems to some degree, which affects their customers. While it rarely happens, providers can experience a complete loss of service.  For this reason, it is vital for enterprises to consider secondary DNS.

Though managed DNS provider availability normally exceeds 99.999 per cent uptime— about five minutes of downtime per year— this top line number does not provide the detail needed to properly assess the business risk associated with relying on a sole source provider. It is not clear, for example, what the probabilities and impact are of degraded performance in certain regions or of a system-wide outage of various duration. 

Enterprises can look at this scenario from their own perspective. Think about what a 30-minute loss of DNS would cost your business in terms of revenue, reputation damage, support costs and recovery. Compare that with the cost of a secondary DNS provider. Typically, the cost of one outage among enterprises for whom online services are mission-critical is roughly 10 times the annual cost of a second service. That would put the break-even point at about one major DNS outage every 10 years.

Moving forward with cloud migration

To keep your business up and running online, it is essential to avoid single points of failure in your cloud implementation, your application stack or your DNS service. There are no guarantees, but there are best practices. Keep these in mind as you move forward with a migration to cloud technology:

  • Remember that cloud solutions should employ an architecture designed for routing, scale, redundancy, and backup & recovery.
  • Consider system and network-wide redundancy to avoid any single points of failure.
  • Prepare for an outage at any level of your organisation or cloud implementation.
  • Be aware of potential areas for third-party risk and mitigate them.

Alex Vayl, co-founder and vice president of demand generation. NS1
Image Credit: Amazon