Earlier this year, PagerDuty shared global research into the impact of unplanned IT work on companies’ abilities to deliver on their business priorities and ensure the wellbeing of their IT responders. The research highlighted risks associated with unplanned work as well as some of the strategies companies are using to mitigate them.
Since then those strategies have been put to the test in ways few of us could have anticipated. As companies mobilised to enable new ways of working, required as a result of Coronavirus, their systems felt the strain.
We saw a doubling of incidents across all sectors and a massive 11x increase for those sectors most impacted by the pandemic. Incidents arose from both from a surge in traffic and a higher than usual number of code and infrastructure changes.
The good news is that by moving quickly into hypercare mode IT responders were able to prevent the majority of issues from escalating. Resolution was as much as 62 per cent faster in some sectors.
For most, hypercare meant meeting unplanned work with an unrelenting focus on reliability, customer experience and real-time resolution. Hypercare works but it can be challenging to maintain in the long run.
Now is a good time to take stock and look at what we need to do, from an operational perspective, as we start to move towards a ‘new normal’ Here are some things companies may want to consider:
First, a word about IT employee wellbeing
Many IT professionals have been operating in crisis mode for weeks now – often from home and often with a child or two in tow. I am sure I am not alone in wanting to extend a heartfelt thanks to all those who have worked so hard on the digital front line.
Unplanned work can impact the wellbeing of IT responders. Our research linked it directly to a rise in anxiety – which can lead to absenteeism and even employee churn. In fact, almost a third of respondents said they had considered leaving a job because of unplanned work in the past.
If companies are not already doing so, it’s worth monitoring when and why individual IT responders receive alerts and making sure workloads are distributed fairly. In the longer term, it’s also worth benchmarking the health of individual teams against the health of teams in similar organisations in order to assess and hone wellbeing strategies.
Walk a mile in your customers’ shoes - but let technology take the strain
Our research showed that even under normal circumstances almost all EMEA IT responders find it difficult to deliver great customer experience. A massive 73 per cent said they were more likely to find out about an IT incident from a dissatisfied customer than their own systems. With so many IT responders focused on unplanned work right now, that figure is likely to have grown. Technologies such as automation and machine learning can help take the strain – identifying and alleviating the impact of potential incidents before they impact.
Be ready to communicate more
An increasingly dispersed and technology-reliant leadership team, workforce and customer-base makes good communication more important than ever in the event of an incident. A lack of communication fosters mistrust, slows resolution and can further damage customer experience.
It’s worth reviewing IT communications strategies in the light of new ways of working. If increased strain on systems results in an outage, the leadership team, HR, customer support and other key business functions will all need to be engaged.
In the middle of a major incident or change, the last thing responders want is to have people from across the company getting in touch to ask what’s going on. Plan to make information available internally for anyone who wants it whenever they want it. That might be via company-wide email updates or a status page on the company website. Key stakeholders might warrant further measures. For example, APIs can be used to create a custom dashboard displaying information like the number of incidents open, their severity, and contact information for on-call engineers.
In cases where they are directly impacted, it’s important to let customers know what’s happening and to do so as quickly as possible. They might have anxious customers of their own. Being proactive will help them to better manage onward communications.
Ensure your incident recovery plans are fit for purpose
Current events have thrown the need for an effective incident recovery plan into sharp relief. The sad truth is that even before the outbreak the majority of plans were not fit for purpose and, our research suggested, those in EMEA were worse than other regions. A massive 70 per cent of IT responders said they had experienced issues their response plans failed to account for. Now is the time to reassess incident recovery plans to ensure they effectively reflect both existing and newly implemented ways of working.
Find areas for improvement
It’s unlikely that the issues we are facing are going to go away any time soon. It’s to everyone’s benefit that we learn to respond to them more effectively. One way to do so is to use an incident post-mortem to identify areas for improvement and strategies to address them. Tensions may be running high right now but it’s important to encourage everyone to speak up and to avoid attributing blame.
Accentuate the positive
Finally, don’t forget that there are also opportunities to build on the many things that worked well in the scramble to respond to changes brought about by the virus.
New ways of working created a need for IT teams to forge closer and more empathetic working relationships with those in other parts of the organisation – with human resources, for example.
Maintaining such relationships will make responding to whatever unplanned work is thrown at us in the future just a little bit easier on everyone. And we could all do with that.
Steve Barrett, Vice President EMEA, PagerDuty