Skip to main content

Is your AWS application built for reliability?

(Image credit: Image Credit: Gil C / Shutterstock)

It’s a misconception that the most reliable systems can somehow sidestep all disruptions, misconfigurations, and network issues. In reality, those hiccups will happen – but reliable systems are those architected to be self-healing, resilient, and able to quickly recover from failures. While building such systems is challenging, the AWS Well-Architected Framework’s Reliability Pillar offers particularly valuable best practices and guidelines for creating architectures with strong foundations, consistent change management, and proven processes for failure recovery.

AWS breaks down the work of achieving reliable systems into three segments: 1) understanding your availability needs, 2) designing applications for availability, and 3) operational considerations. Let’s take a look at each, and what the best practices look like.

1. Understanding availability requirements

Availability needs vary: an ETL or batch processing system might only require two-nines availability (allowing for around three-and-a-half days of disruption per year) to be effectively reliable for its purposes, while a critical financial system might demand five-nines availability (allowing just five minutes of disruption annually). Gauging these needs accurately is critical, since meeting higher availability demands can significantly increase service and development costs.

Most applications don’t have a singular availability target, but instead are made of several components that each have differing availability requirements. For example, consider an ecommerce application demanding high availability for its customer-facing order system, but with lower requirements for its processing and fulfilment components. AWS recommends carefully evaluating these variables within your application and differentiating your availability goals as appropriate.

AWS divides services into a data plane that delivers the service and a control plane handling less critical configuration activity. In AWS, the data plane is used for operations such as Dynamo read-write operations, RDS connectivity, and EC2 instance connectivity. The control plane is used for operations like launching EC2 instances, creating S3 buckets, and provisioning RDS instances. This approach enables your business to focus efforts on the components with the most critical availability needs.

2. Application design for availability

AWS has identified these five key practices for improving application availability:

1) Fault isolation zones.

AWS provides different fault isolation zone constructs (including availability zones and regions) useful for keeping systems available even when individual components fail. Availability zones are the better choice for low latency needs, while regions offer more isolation but are inappropriate for low latency needs.

2) Redundant components.

Physical infrastructure must also be designed to avoid single points of failure. Fault isolation zones enable deployment of redundant components, which operate in parallel to increase availability. Where possible, systems should also tolerate failures within an availability zone, through deployment to multiple compute nodes, storage volumes, etc.

3) Microservice architecture.

Splitting your application into microservices makes it simpler to focus attention on those with greater availability needs. For this reason, microservices should publish their availability targets. Three key benefits to microservices delineated by AWS include:

  • Designing a particular service to serve a concise business problem owned by a small team allows for better organisational scaling.
  • The team can deploy the service at any time (as long as API and other requirements are met).
  • The team can use any technology stack (again, while meeting all requirements).

Using the distributed compute architecture that comes with microservices does increase operational complexity and the challenge of debugging and achieving low latency. Following the best practices of the Well-Architected Framework and using AWS X-Ray for debugging can help overcome these challenges.

4) Recovery-oriented computing.

Recovery =-Oriented Computing (ROC) is a systematic approach to improving failure recovery by utilising isolation and redundancy, the capability of system-wide roll-back changes, monitoring and health determinations, diagnostics, automated recovery, modular design, and the ability to restart. In this way, ROC champions rapid failure detection and automated recovery, and avoids creating special cases or rarely-tested recovery paths.

5) Distributed systems best practices.

Following the Well-Architected Framework means building distributed systems, which themselves must be designed to recover from disruptions. To achieve that, follow these best practices:

-- Throttling. Once you understand the capacity of your microservices to handle incoming requests, implement appropriate throttling to make sure the service remains responsive during high load periods. This means users will receive a message to try again later, or that the request will fail-over to a redundant component (rather than simply failing). Support for request throttling is built into the AWS API Gateway.

  • Retry with exponential fallback. To avoid overwhelming web services with retries, enact pauses (of increasing length) after each retry attempt.
  • Fail fast. Instead of queueing up requests during a disruption to later work through (delaying recovery), simply fail fast and return errors when possible.
  • Idempotency tokens. Idempotency means “repeatable without side effects.” Because retries and timeouts are common within distributed systems, issuing API requests with idempotency tokens allows microservices to know if work is completed, and to return another identical token on repeated requests with no problematic effects.
  • Constant work. Distributing load over time improves resiliency. Commonly, idle time will be occupied with “filler work” to keep loads constant.
  • Circuit breaker. Hard dependencies are dangerous to availability. Avoid them by using a “circuit breaker” to control the flow of requests to a downstream dependency, which is monitored in a loop and toggled if the service has an issue. The dependency is then either ignored and requests are attempted later, or data is replaced from a locally-available store.
  • Bi-modal behaviour and static stability. To avoid cascading failures (driven by systems with different behaviours under normal and failure modes), design components to function effectively in isolation.

3. Operational Considerations for Availability

In addition to architectural best practices, operational best practices supporting reliability should be adhered to carefully. These include:

1) Automate deployments to eliminate impact.

Minimise risks during deployment by leveraging automation. This can be done through:

  • Canary deployment. This technique pushes changes to a small number of users and monitors the impact before deploying widely.
  • Blue-green deployment. This involves deploying two complete application versions in parallel and sending a segment of traffic to each deployment.
  • Blue-green deployment. This involves deploying two complete application versions in parallel and sending a segment of traffic to each deployment.

Most importantly, the deployment process ought to be fully automated, with changes performed through continuous integration pipelines.

2) Testing.

Testing how your systems perform under strain is key to understanding your true availability. Unit, load, and performance tests – as well as failure simulations to test your procedures – will provide insightful report cards.

3) Monitoring and alarming.

For effective monitoring, AWS outlines a process of generating data that tracks key components and metrics across every layer of your application. You can then aggregate and store that data with Amazon CloudWatch and Amazon S3, put it to work to enable real-time analytics and alarming, and perform further processing and analytics with services like Amazon Athena, Amazon Redshift Spectrum, and AWS QuickSight.

4) Operational readiness reviews.

AWS recommends operational readiness reviews (ORRs) to ensure applications are ready for production. ORRs begin with operational requirements, but also add learnings over time and should be repeated annually (at least).

5) Auditing.

Regularly audit your monitoring systems to ensure they are effective.

Conclusion

By following the best practices detailed in the particularly important second pillar of the AWS Well-Architected Framework, you can be sure that you’re implementing battle-tested approaches to designing reliable, highly available applications.

Jonathan LaCour, CTO, Mission