Why DevOps needs ‘Chaos Days’

After years of to-ing and fro-ing, 95% of enterprises are beginning to embrace the cloud. The main cause of the reluctance was from the boardroom, due to their uncertainty and anxiety with the perceived ‘lack of security’ presented by the cloud. However, engineers are designing and implementing high tech cloud systems designed to withstand huge amounts of pressure. Combined with CloudOps teams with the knowledge and experience, cloud has become safer. However, management teams still need to see their teams cope when it all goes wrong. 

A pioneering method, recently trialled by The Met Office, is the controlled destruction of cloud infrastructure in order to test the capability of CloudOps team. Over the course of a working day, The Met Office IT teams were allowed to break some critical parts of their application infrastructure (albeit, very carefully), to see how their CloudOps teams learn to investigate and fix problems, and what they still have to improve.   

The training and app development method is based on Netflix’s Chaos Monkey, which is an automated tool which randomly breaks small and large parts of their cloud environments, so their CloudOps teams can ensure systems are fully redundant and designed to handle outages and other problems. I tweaked the name of the session and the ‘Cloud Chaos Day’ was born.   

Why is Chaos needed? 

The Chaos Day was created from question: “How ready is our CloudOps team is to handle our application in production, before it’s being used by customers?” The solution was to break it. The goal behind this was what could the business learn from doing this? Simply put, it can identify the gaps in knowledge, training or documentation, which can then be resolved before the problems happen in future. 

Teams undertaking a ‘Cloud Chaos day’ should use the same principle but it is best to simplify the process by manually selecting areas of applications to break, and to let the CloudOps team be informed of and debug the problems in conjunction with the application development teams.   

The overall goals are to identify problem areas and to develop an improvement plan. Whether the problem areas are found in cloud systems knowledge, product experience, product documentation or even correct access, the team needs to know. 

What happened on The Met Office Chaos Day?   

The first step is to manually break things, in the long term, as the CloudOps team develop and improve they can begin to automate this process. Secondly, through the involvement of many different stakeholders including application developers, cloud engineers and system architects a list of potential items to break can be drawn up. 

These stakeholders can then work closely to identify which parts of the target applications can be broken without production impact. In this instance it is vital to have a tool which produces fully working copies of applications in different accounts, enabling testing to be done without adversely affecting production services.   

The final stage is to plan the day – estimate what will be ‘broken’, how long it might take to fix, what the user report could look like, and more. 

On the day, get the teams to join together in a meeting room and present each breakage as a set of symptoms. As the teams work through each problem, make sure the discoveries and learnings from each system or cloud service is noted down. This helps build a greater picture of the unknown.   

With The Met Office, after the first few problems were carefully unravelled, identified and fixed by the team, it was clear that the team were learning a lot and everyone was enjoying the experience!   

One main highlight from the process was discovered that dividing the team is vital, for example have your networking specialist attacking problems from one direction, the support specialist the other, enabling them to meet nicely in the middle and eventually track down the root cause.   

By the time the day is finished the teams should have some good outputs, including vast notes, a better understanding of the system and more cooperative approach app development.   

Learnings from the Chaos: 

Some of the key points that need to be considered before undertaking a Cloud Chaos day are: 

Know your team: If your team is mainly networking specialists, it’s going to to be easy for them to find networking problems. If you’ve got a mix, do a range of things so everyone gets a chance to share their knowledge. 

Break with care!: In the cloud you can create copies of environments to test these things with. So spin one up. Don’t risk your production data if you don’t need to.   

Make backups that you can quickly restore from: You might break something you didn’t intend to. Make sure you have a rollback and restore plan for every ‘breakage’ you make, so that you can fix any unintended consequences quickly!   

Start simple: In real breakages or accidental changes, simple stuff happens as well. As you see how the team responds, you can increase the difficulty such as breaking multiple things at once. 

Don’t be tempted to be too clever too early: Remember, the goal is find out areas for improvement, not to defeat your CloudOps team!   

Timebox the breakages: Typically beyond about 30-45 mins per breakage will help keep people engaged without losing focus. 

Audit tools are sort of ‘cheating’: Audit tools such as AWS CloudTrail can be your undoing with a clever team looking for changes. You can avoid this somewhat by using different users, or have something such as a lambda function on an instance to trigger the changes. However, ultimately you’ll probably have to restrict your teams from jumping straight to CloudTrail or it will get pretty boring fast! 

Make it real: Try and present your problems to the CloudOps team as users would - an email with screenshots, error messages etc. 

Rehearse your breaks: Test your breakages out beforehand - no need to keep people waiting for you to break things. 

Break more than you want: Plan to break more things than you do - just in case some don’t work or are solved really easily!   

Bring snacks!: It helps keep things relaxed and energy levels up. Our preference is for doughnuts. 

Take regular breaks!: After every 2-3 problems is ideal.   

Brainpower + Chaos Day = better apps 

Cloud Chaos days are a great way to help your team know what they don’t know. As with Netflix’s Chaos Monkey, this method not only tests your systems but your team as well, helping alleviate the fears of the C-suite by showing them the winning combination of smart technology and smart people. For example, The Met Office day showed that the fastest time to fix a problem was under five minutes. By implementing Cloud Chaos days you can help your IT team learn more, develop faster and be prepared for any turbulence along the way.   

James Wells, Systems Developer, Cloudreach 

Image Credit: Rawpixel / Shutterstock