IT complexity and the art of high reliability

There’s no doubt that IT has been a major productivity driver to businesses of all sizes. Businesses today operate at unprecedented levels of scale and speed, and many regularly make use of IT to enable the delivery of products and services that only a few years ago seemed unimaginable.

The IT systems that support these businesses are even more complex, and they are embedded in more parts of the business. Marc Andreessen’s popular Wall Street Journal essay “Software Is Eating the World” illustrates that IT deployment now enables businesses to move faster while, at the same time, those businesses’ needs are increasing.

This causes a massive increase in the size and scope of IT departments. Since it is impossible to avoid failures in IT systems, modern IT operations strives to minimise the impact of failure by increasing the responsiveness of systems (i.e. auto failover) and people (i.e. incident response) when problems arise.

Often, a disruption in the infrastructure is no longer just internal – it’s now a customer service event that can erode confidence both inside and out of an organization. Even when not directly customer-facing, the importance of IT systems has grown to a point that it can cripple internal business processes if it's not available. In severe cases, the valuation of the business as a whole could very well be at stake.

Businesses are now challenged to make absolutely certain there are processes in place designed to minimise the impact of IT disruptions. More importantly, IT teams must invest in software automation and mature practices that allow them to proactively scale, support the speed businesses demand and reduce the impact of failures.

IT Automation Introduces New Benefits and Risks

To enable scalability, organisations have been investing in processes that enable them to manage IT more effectively.

First came a wave of virtualisation that served to increase infrastructure efficiency and utilisation rates. Now organisations are moving towards a new software-defined data center era in which the same, or even fewer, number of IT administrators will be needed to programmatically manage orders of magnitude more IT infrastructure.

In a world where IT Operations infrastructure is highly automated, however, a single error can propagate instantly across a global network of data centers and bring the business to a screeching halt. Without a well-defined set of policies and processes in place for dealing with such failure in automation, a single error or misconfiguration can set off a chain of cascading catastrophic events across an entire digital ecosystem.

As investors or key stakeholders realise how much of the business is dependent on IT and that failure will happen, they will become more aware of the level of real risk involved in managing modern IT Ops systems. It’s only a matter of time before they start demanding to know what kind of incident response plan the organisation has in place to mitigate those risks.

Legacy IT Ops Systems Aren’t Cutting It

Because of legacy processes and output, most organisations today don’t have an incident management process or plan in place. As evidenced by a recent survey conducted by Forrester Consulting, the primary expectation the business has of IT Ops is to maintain system availability and uptime.

At the same time, however, business executives also make it very clear that they want the same groups to deliver additional innovative services rapidly. And therein lies a great paradox: many businesses err on the side of caution and test systems exhaustively. But this approach doesn’t cut it anymore. We must be willing to deploy, learn in product, and then be able to fix problems fast.

In an ideal world, IT Ops systems should naturally be engineered to absorb failure. Past a certain base amount of testing, the ability to move quickly - to notice the problem, fix it, and redeploy - is what makes a business “safe.” Although there is more complexity in the IT environment than ever, all the distributed virtual and physical servers deployed in an IT environment provide a high degree of resiliency when properly architected. When one server goes down, the clients should, in theory, automatically be shifted to another server to avoid disruption in service.

Unfortunately, most IT environments today rely on manual change management processes that either exist in one person’s head or have been poorly documented in a spreadsheet or, worse yet, piece of paper. Without automated change management processes, IT architectures cannot remain resilient because recovery will take too long.

All it takes is a few key missteps to create a disruption that leads to massive revenue loss. Downtime will worry investors and stakeholders, and causes customers to consider taking their business to a competitor.

Modern IT Ops Systems Deliver Uptime and Agility

Achieving true IT resiliency requires leaving legacy ideals behind and employing the right people, process and systems. Efficiently managing the people and process is key. Even more critical is having a system that can identify sources of potential issues long before they become problems.

In short, the days when IT departments operate in a perpetual crisis mode are coming to an end. Now that businesses are more dependent on IT than ever, business leaders need to know that the IT applications and processes they rely on are stable.

Given the complexity of modern IT environments, IT Ops systems must now keep track of what changes need to be made and by whom. This policy-driven approach helps mitigate the inevitable factors beyond the control of the IT organisation.

The difference between a great IT organisation and a merely good one is its crisis response. In a great IT Ops organisation, the response should not be created once the crisis occurs. Rather, it should be a well-defined set of processes that is second nature to all concerned and codified in automated systems that help guide the responders through the incident management process.

Instead of measuring meaningless system and application uptime metrics in isolation, the focus of the IT Ops organisation must shift to the actual level of business innovation that IT is enabling by the consistent delivery of IT services. The real challenge is achieving that uptime in a way that doesn’t come at the expense of business agility.

Of course, there’s always going to be tension between speed and availability. The role of the technology leader in a digital economy is to make certain that when those eventual IT missteps occur, those mistakes don’t wind up being an event from which the business cannot fully recovers.

Tim Armandpour, VP of Engineering, PagerDuty