You wouldn’t drive a car with untested airbags, so why do you use systems without Chaos Engineering?
Daydream with me. Imagine the slow-motion view of shattered glass flying in the air, while the 1.5 tonnes of steel that used to be a car seconds ago is slowly hitting a concrete block. Notice the multitude of high-speed cameras, measurement devices and the test dummy.
Virtually every new car goes through this process to verify that all of the components work together to achieve its most important objective: protecting your life. Of course, before reaching this stage, all of the pieces are tested in isolation. But until they’re all put together, it’s almost impossible to predict the behavior of the finished product during an accident. And that’s why the automotive industry spends millions destroying their own products every year.
It sounds counterintuitive at first, but it really isn’t. In any sufficiently complex system, there are emergent properties that are notoriously difficult to predict - unexpected interactions that only manifest themselves in certain conditions. So, the only way to ensure your system continues working as expected is to create these conditions, observe the system and measure the results.
This deliberate, controlled, scientific method of experimenting on a production system to increase confidence in it working as expected is what we call Chaos Engineering.
Let’s be clear - Chaos Engineering doesn’t replace your customary ways of testing software - it adds to them. Testing that the airbag triggers when you apply pressure is the equivalent of unit testing. Verifying that hitting the car bumper with a big enough hammer deploys the airbags is like integration testing. And verifying that a test dummy comes out undamaged after smashing the entire car into a concrete block at 50 km/h is Chaos Engineering. In a way, it’s like end-to-end testing, but with failure scenarios, rather than the happy path.
It’s not about reckless testing in production either. It’s a common myth. Chaos Engineering can - and should - be applied at all stages of development and deployment. Yes, the holy grail is to be so confident in your testing, that you introduce failure in production, reasonably expecting that the system will handle it gracefully. But in order to get there, you need to put in the work and apply the same principles you would when releasing any other piece of software.
It’s not only reserved for the large, mature projects with 100 percent test coverage and perfect documentation. Some people, when first hearing about Chaos Engineering, are wary that their systems might not be stable enough for them to start adding failure scenarios of their own. It’s a false assumption. Chaos Engineering is not there to add more chaos to your systems. It’s there to reduce the amount of chaos that’s already there by scientifically verifying that your assumptions about the systems’ behavior are correct. So the good news is that you can - and should - get started with Chaos Engineering as soon as possible, rather than waiting for the elusive point of stability.
The scale does matter, though. If your personal computer has worked without any issues for the last 5 years, it's easy to assume that it’s essentially immortal. But even if we assume an amazing median time to failure of 10 years, you only need to be using 3650 servers to average out a failure every day. So even if you’re never written a bug in your entire life, and all of your software works flawlessly, and all the other software that you’re relying on works flawlessly, you are still subject to the same laws of physics as all matter in the universe, and you’re going to have to deal with failures. So why wait for problems to find you, if you can proactively find them?
It’s also worth pointing out that Chaos Engineering is a methodology that works with all stacks, languages and technologies. So, it’s not only for shiny, new technologies. It works with systems of all sizes, shapes and forms - and you can benefit from it whether you’re Fortune 100, or a solo founder.
And if none of these arguments convinced you, let me try one more. Think back to the last time you were called outside of working hours because something went wrong with your system. Think about how it felt to wake up in the middle of the night, log in, try to gather context from all the noise and panic. Remember heavy eyelids, hot coffee and the cold keyboard. Would you like to try a technique that can help prevent these moments from happening? I thought so too.
Pure Cloud - Go native!
The primary driver for Chaos Engineering is the ever-increasing complexity of the systems we build. Empowered by the cloud, businesses can now grow bigger and faster than ever before - and with smaller engineering teams. New startups build from the ground up as Cloud Native - leveraging technologies like Kubernetes and service mesh and methodologies like microservices.
But in this Cloud Native, distributed and high-speed environment, the extra complexity, if left unchecked, can be devastating. For example, remember when your team celebrated getting rid of that legacy server? Migrating off of a monolithic application onto a series of microservices increased scaling capacity, visibility and fault-tolerance; but it also introduced a service mesh, exposure to networking issues (errors and slowness), retries and back-off algorithms and a brand-new set of debugging tools and techniques. Shifting responsibility from application to platform teams additionally means the need to handle large numbers of clients for the latter.
Chaos Engineering turns out to be an invaluable tool for handling this complexity. Test the system as a whole and verify that it continues to work as expected during real-life-like outage scenarios. On top of this, Chaos Engineering experiments offer the best ROI you will ever get. With just a small inexpensive experiment - you can detect a massive problem before your users do. What’s not to like about that?
Mikolaj Pawlikowski, engineering lead, Kubernetes, Bloomberg