Skip to main content

How to mitigate the risk and consequences of IT failures

(Image credit: Den Rise/Shutterstock)

Rapid advancements in IT systems have enabled people today to live in a world that is radically different from what it was 20 years ago. Almost everything we interact with on a daily basis is connected with technology, from travelling to work using contactless payments, checking our bank balances on our mobiles via banking apps and working online at work. All of that is possible thanks to the seamless integration of IT systems. However, when those IT systems fail, there are huge consequences – major disruption can be caused, money can be lost and a business’ reputation can be damaged.

Just this month, the FCA has reported that there has been a 138 per cent rise in technology failures at finance firms during the period from January – October 2018. There are crucial errors being made with simple computer updates and this is made all the worse with a reported 18 per cent rise in cyber-attacks. Businesses need to put practical measures in place to ensure that IT systems remain because, as Megan Butler, Director of Supervision at the FCA states, IT failures are becoming “an increasing threat to UK customers.”

IT failures – The knock-on affects   

Just this year we have seen a number of IT outages that have had serious knock-on affects for customers, for example air traffic control centre Eurocontrol suffered an IT failure in April that grounded over 500,000 passengers or the widespread computer failure that affected NHS Wales, which prevented GPs from accessing medical test results and resulted in a backlog of patients. The cost of these failures is no small amount either; a router failure at Southwest Airlines in the US led to an estimated $54 million to $82 million in lost revenue and increased costs. In fact, recent research from Gartner revealed that IT downtime costs organisations $5,600 per minute, so businesses must make sure they are implementing an airtight IT strategy to prevent such damaging events from happening.

Businesses don’t just pay for these events in lost revenue but also in reputational damages and nowhere is this clearer than the fallout from the TSB IT failure. In April 2018, 1.9 million TSB customers were locked out of their accounts after an IT upgrade led to an online banking outage. While the system upgrade did plan and prewarn customers that it would disable its internet and banking services for one weekend, the IT failure actually resulted in months’ worth of disruption. Needless to say, customers were furious. But so were TSB’s parent corporation Banco Sabadell, whose board reportedly discussed selling TSB following the botched IT upgrade due to the reputational damage. 

Success is failure turned inside out

However, as Richard Branson said “you don’t learn to walk by following rules. You learn by doing, and falling over” – there is a lot that can be learnt from these IT failures that can help to future proof organisations and progress is being made on this front. For example, in reaction to the TSB incident, a joint initiative from the Bank of England and Financial Conduct Authority enforced banks to report their exposure to risks and response measures for outages by the 5th October this year. The findings have yet to be reported, but needless to say many banking corporations are waiting with baited breath for the insight this analysis will provide. As well as this, in mid-August, Britain’s five biggest banks (Barclays, Lloyds Banking Group, HSBC, Santander, and the Royal Bank of Scotland) stated that they suffered 64 payment outages in Q2 alone and it has been suggested that a maximum outage time of two days will be introduced in banking. Other businesses would be wise to follow this lead as new standards for all sectors are likely not far away.

Practical practices

However, there are a variety of practices and principles that organisations can implement now in order to limit IT failures and downtime. Firstly, when going live with a new solution or upgrading a system (as was the case with TSB), a staggered implementation is a proven way to prevent widespread risk. As well as this, an organisation can test the project by running a pilot program, which affects only a select group of customers for a limited amount of time. Communication is key in this scenario, as it is likely that this will take time away from the users and they should expect some issues. As such, the IT department needs to prepare to keep an open line of communication to those who are affected.

When executed well, pilots are a valuable activity and greatly reduce the risks associated with production system rollouts. Organisations can also spin this to its advantage and present it as a benefit to the end-user. We have seen success with a similar strategy by Apple who rolls out developer versions of system updates to select users with a desire to try out the latest features for their phone or Mac.

Preparing for the worst-case scenario is also critical and regularly testing continuity management processes to ensure there is always a seamless Disaster Recovery or failover process in place is equally as important as any preventative measures. Therefore, if there is an IT failure, it can be resolved seamlessly and quickly mitigate the consequences. Crisis planning should involve all stakeholders in the business in order to identify anything, even cases that are highly unlikely, that could possibly go wrong so that a relevant plan can be put into place.  This process creates an effective secondary line of defence.

In case of a serious IT failure reacting in the correct way is vital. Some organisations can be tempted to make hasty personnel decisions for PR benefits, but improvement comes from knowledge and knowledge comes from learning. It is a waste of time and money to punish an engineer in a reactionary manner because they’ll be less likely to give the necessary details on how and why the failure happened. Furthermore, an IT failure is rarely the fault of one individual. Following any incidents, an organisation must analyse its cause and apply changes to prevent them from happening again. This process demands an increasingly mature model and one organisations should invest in over time.  In fact, so many aspects of IT Service Management seem to apply here: from major incident, through service continuity, to problem, knowledge and change management and then into continual service improvement.

Much of our daily lives evolve around technology and the seamless experiences that we demand are facilitated by IT. The threat that IT outages pose to businesses is critical and need to be proactively mitigated to ensure that our ‘always on’ world is in fact, always on. Organisations need to collectively invest in order to reduce risks and implement policies that strengthen IT and in turn strengthen the business.

Kevin J Smith, SVP, Ivanti (opens in new tab)
Den Rise/Shutterstock

Kevin J Smith is the SVP at Ivanti.