Skip to main content

DevOps by the numbers - what metrics should you be keeping an eye on?

(Image credit: Image Credit: Profit_Image / Shutterstock)

Companies are using more cloud services, and they are adopting more agile development practices as they do. According to Gartner, public cloud spending went up by more than 40 percent in 2020 to $64.3 billion worldwide. Meanwhile, the State of Agile report by estimated that agile adoption increased from 37 percent in 2020 to 86 percent in 2021, while 75 percent of teams said DevOps was essential to how they worked. Both cloud and agile adoption went up massively over the previous year, and the pandemic was probably a big factor in those choices. 

However, how prepared are teams to track their environments when everything is distributed, running in the cloud, and based on fast response to change rather than long-term development strategies? Using metrics is essential to keep any new project on the path, even when that path may meander more than previously based on changing requirements. So what metrics should you track?

Looking at the whole approach to development and DevOps

To start with, it is worth looking at what DevOps is there to achieve. The main goal for DevOps since it first started was to make the process for getting software into production easier and more efficient. This involved reducing organizational silos that would get in the way of moving code through to production, but it also developed some best practices along the way.

The first of these is that change should be implemented gradually, rather than taking place in big steps. By making small changes, the overall impact should be reduced and the risk of any change taking down systems should come down as well. Conversely, this should also make it easier to make more changes over time - rather than waiting for monthly or quarterly release windows to make updates, these can be rolled out more regularly. According to the benchmarking results in the DevOps Research & Assessment report, high-performing software teams can release updates between once per day and once per week, while elite level teams release updates to production multiple times per day.

Alongside this ability to release more often, the DevOps process should help releases be more consistent. Making this happen puts more emphasis on how software goes through the continuous integration/continuous deployment pipeline, and the tools used as part of the process. Automating steps where possible helps to improve that consistency, and also provides more data on how things operate.

This data is essential to show where things are working well, and where things are either in need of a fix or where improvements can be made. This data-led approach is more common across businesses today as everything goes “data-driven” and software development is no exception. The big change is that this role for data should be less reactive - looking for issues in logs, say - and more proactive. This also links into the other big shift in mindset that DevOps calls for around problems or failures that might arise.

Developers are human, and they will make mistakes. With so many changes taking place, the chance of something affecting performance exists and inevitably something will happen that affects the application. Ideally, this gets caught in testing, but it may make its way through to production. With smaller and more discrete changes planned, it should be possible to roll back any problem and fix the issue quickly. Thorough post-mortem discussions will still take place, but with the goal to prevent similar problems in the future rather than seeking to blame individuals.

What should we measure?

So, we have a different approach in mind - more automation in place, measurement of all infrastructure and processes, more frequent releases, and less blame for failure. How should we translate those lofty goals into things that we can measure ourselves against and see how we are progressing?

The first item to measure should be Change Lead Time, as this tracks the time it takes for a project to go from inception to implementation. This involves looking at each specific project or set of features that you have to work on, and how long it takes for that project to get into production. As larger projects will generally be broken up into smaller sets of features or changes, this metric doesn't mean you have to track the delivery of an entire software product in one go.

Change Lead Time measures how quickly you can gather requirements and then turn those business requests into software updates. By understanding how it takes to put together code, test it, and then deliver that feature to production, you can see how effective your team is in practice.

A shorter Change Lead Time on average should indicate that your team is working well and able to deliver new features to customers quickly. It should also indicate how well your team can adapt to feedback from those customers over time. This metric also helps to demonstrate how well the team is doing at implementing smaller, more gradual changes in line with DevOps best practices. Tracking this over time gives you a good idea of how far ahead you can plan and how well you are performing.

Alongside Change Lead Time, Deployment Frequency is another good measure for how quickly your team can deliver new features to production. While Change Lead Time tracks the speed to fulfill requirements, Deployment Frequency looks at how quickly those updates get released successfully. Smaller, more rapid deployments help keep feature releases small, and they also help increase the likelihood that fewer bugs will creep into production. As with change lead time, this metric also helps satisfy the DevOps pillar of implementing gradual yet rapid change.

A very productive team in an agile organization would ideally be able to deliver several releases per week, sometimes as often as several times per day. This normally indicates that the organization's automation and tooling (such as CI/CD pipelines) are up to the task of rapid software iteration and delivery.

Alongside these two metrics, it is also worth looking at the quality of all those changes coming through. The ability to deliver code quickly is one thing, but if those changes then need rework all the time then the result will be less impressive. To measure this in metric terms, Change Failure Rate (CFR) is the percentage of all changes that were delivered to production, but either failed or had bugs or defects. Changes with severe defects frequently result in rollbacks, which can hurt other metrics such as uptime and revenue.

Delivering smaller releases rapidly is an indication of healthy DevOps processes as well as a healthy DevOps culture. However, frequent failures can be indicative of poor code quality, poor testing, too much pressure on teams to deliver features too quickly, or other problems within the software development process.

In a perfect world, your CFR metric should be zero. In reality, good teams implementing software well will see their CFR in the very low single digits. If you start experiencing double-digit CFR, then you should look at your testing and software quality stages to see what is going wrong. While you may have problems with software that is very buggy, the other reasons for high CFR are that your software release cycles are not working effectively. In these circumstances, look at how you can break your projects down further or allow more time for testing to improve your success rates.

What else to track? 

Alongside these three elements, there are some other metrics that are useful for the whole IT team to consider. The first is Mean Time To Detection, or MTTD, which describes how long it takes for a problem to arise and be detected, either by humans or by automated tracking of service levels. If you are not able to spot these issues quickly, then it is more likely that they will lead to interruptions in service or poor performance. The longer it takes to detect a problem, the more likely it is that other changes will be piled on top of it, making it more difficult to sort out which specific change caused the problem in the first place. You can reduce your MTTD by implementing solid monitoring and observability best practices, looking at your tooling, and improving your service level indicators.

MTTD is also a common metric for security teams to track - while security operations center (SOC) analysts look for potential issues in the enterprise network to investigate, developers will look out for issues with availability or performance. These two definitions do cross over and analysis can be based on the same set of data, so collaborating on this can be very helpful over time. 

Alongside MTTD, Mean Time To Recovery, or MTTR, is another metric that should be tracked over time. MTTR is measured from the time it takes to detect a problem to the time it reverts back to a baseline state. Recovery is typically accomplished through rollbacks to a ‘known-good’ state. Rapid and easy rollbacks result in a lower MTTR, and ideally, your MTTR will be measured in seconds.

If your MTTR is higher than this - for example, if it takes longer than a few minutes to go back to a previous state - then it means that your approach to DevOps will need to be updated so that you can improve your processes. A poor MTTR will also be a leading indicator for impact on any Service Level Agreements that you have in place. In order to improve your MTTR, you can improve your automation steps across your CI/CD pipelines and look at your incident response and rollback procedures. Again, talking to your security team can help as they should already have solid recovery processes in place that you can either learn from or adapt.

DevOps aims to improve how software development teams perform over time based on a mix of better tooling, more automation and good data. To demonstrate this in action, getting the right metrics approach in place is essential. Using continuous intelligence data from your software development operations and from your CI/CD pipelines can help you fulfill your potential.

Iain Chidgey, Vice President EMEA, Sumo Logic