ComputerWeekly’s Cliff Saran writes that ‘AWS Outage Shows Vulnerability of Cloud Disaster Recovery’ in his article of 6th March 2017. He cites the S3 outage suffered by Amazon Web Services (AWS) on 28th February 2017 as an example of the risks you pose by running critical systems in the public cloud. “The consensus is that the public cloud is superior to on-premise datacentres, but AWS’s outage, caused by human error, shows that even the most sophisticated cloud IT infrastructure is not infallible”, he says.
Given that the AWS outage (opens in new tab) was caused by human error, the first question I’d ask is whether blaming the public cloud for the outage is fair. The second question that I’d like to pose is: Could this incident have been prevented by using a data acceleration solution to deploy machine intelligence to reduce the potential calamities that can be caused by human error? In the case of the AWS S3 outage a simple typographical error wreaked havoc to the extent that the company couldn’t – according to The Register – “get into its own dashboard to warn the world.”
With human error being at fault it’s doesn’t seem fair to blame the public cloud. What organisations like your own need - whether you are an AWS customer or not - is several business continuity, service continuity and disaster recovery options in place. They must be supported by an ability to back up and restore your data to any cloud in real-time. This means that your data and your resources should neither be concentrated in just one data centre, nor focused on one means of cloud storage. So when disaster strikes, your data is ready and your operations can switch to another data centre or to another disaster recovery site without damaging the ability of your business to operate.
Some experts believe that British Airways (BA) could have avoided its recent computer failure, which is expected to have cost £150m, if it had the right disaster recovery strategies in place. The worldwide outage on 27th May 2017 left its passengers stranded at airports, and it has no doubt damaged the airline’s brand image with newspaper reports predicting the demise of the company.
A blogger for The Economist wrote on 29th May 2017: (opens in new tab) “The whole experience was alarming. The BA staff clearly were as poorly informed as the passengers; no one in management had taken control. No one was prioritising those passengers who had waited longest. No one was checking that planes were on their way before changing flight times. BA has a dominant position, in terms of take-off slots, at Heathrow, Europe's busiest hub. On the basis of this weekend's performance, it does not deserve it.”
A couple of days later Tanya Powley and Nathalie Thomas asked in their article for The Financial Times: “BA’s computer meltdown: how did it happen (opens in new tab)? They point out that “Leading computer and electricity experts have expressed scepticism over British Airways blaming a power surge for the systems meltdown that led to travel chaos for 75,000 passengers worldwide over the weekend.” Although BA has denied that human error was at fault, most experts don’t agree with the company’s stance.
Powley and Thomas explains: “Willie Walsh, chief executive of IAG, BA’s parent company, attributed the problem to a back-up system known as an “uninterruptible power supply” — essentially a big battery connected to the mains power — that is supposed to ensure that IT systems and data centres can continue to function even if there is a power outage.” Experts said that UPS systems rarely fail, and even when they do they shouldn’t affect the ongoing mains supply to a data centre. BA claimed that the incident had also caused damage to their IT infrastructure.
It now transpires that the power outage was due to a contract maintenance worker inadvertently turning off the power. Some commentators suggest that this might not be true. After all BA will want to avoid the huge cost of any potential litigation, making convenient to pass the buck. Yet as well as human error from a technical perspective being blamed, news agency Reuters points out that the company had engaged in cost-cutting exercises to enable it to compete with low-cost airline rival Ryan Air and EasyJet. This lead to many commentators suggesting that BA has taken too many short cuts to achieve this aim – resulting in an inability to keep going in the face of a systems failure.
The question still unanswered is: Why didn’t the second synchronised data centre that BA has a kilometre away kick-in like it should have done? The whole point of running two data centres that close together is just for this situation. That’s the elephant in the room, and it’s a question no one seems to be asking.
Speaking about the AWS incident, David Trossell, CEO and CTO of Bridgeworks (opens in new tab)comments: “Artificial intelligence (AI) is no match for human stupidity. Why do people think that just because it is “in the cloud”, they can devolve all responsibility to protect their data and their business continuity to someone else? Cloud is a facility to run your applications on – it is still up to you to ensure that your data and applications are safe and that you have back-up plans in place.” Without them you’ll have to suffer the consequences.
So the weakness doesn’t necessarily lie in the public cloud. “Someone made a mistake – someone can make the same error on premise: The difference is that one storage method has a wider impact, but for the individual company the effect is the same”, he says before asking, “Where was their DR plan?” He also points out that: “Companies invest in dual data centres to maintain business continuity, so why do they think that only having one cloud provider gives the same level of protection?” It quite clearly doesn’t. With data driving most businesses today, uptime must be maintained and prioritised.
So fault whenever an outage occurs often lies with us humans – from configuring a network poorly to poor software development. In wide area network (WAN) terms it can lie in how the network and the interconnecting elements are managed. So to reduce human error there is a need to deploy machine learning. After all, machines don’t and can’t make typographical errors. Instead they can support us, enable us to focus on more strategic business and IT activities by automating – for example – the configuration of a network to reduce the impact of data latency and to reduce packet loss. In other words, machine learning and AI can make us more efficient.
As for public clouds, they have become more popular in spite of previous concerns about security. Trevor James’ article headline in Techtarget, for example, say ‘Healthcare’s public cloud adoption highlights [the] market’s maturity’. This is a market that often lags behind the adoption of newer IT, and its perception about cloud computing has allegedly been no different. Yet James says that the use of the public cloud in this sector has accelerated. One of those providers is AWS.
Over the last few years cloud providers have addressed their customers’ security concerns by adding a number of tools to permit the encryption of data when it’s at rest and in transit. In the US, for example, once these issues were addressed it became possible to talk about uptime, disaster recovery, security and to ask questions about whether there is a need to develop expertise to run first-class data centres. If the latter isn’t feasible, then it’s a good choice to outsource to a data centre that already has the skills and resources to help you to maintain both your organisation’s data security and uptime.
However, organisations still need to take a step back. Trossell warns: “Public cloud has seen a massive expansion lately, but companies are throwing out the rules, procedures and procedures that have stood the passage of time and saved many originations from ruin.” He says backing up is about recovery point objectives (RPOs) and about recovery time objectives (RTOs). RPOs refer to the amount of data that’s at risk. They consider the time between data protection event, and they refer to the amount of data that could be lost during a disaster recovery process. RTOs refer to how quickly data can be restored during disaster recovery, to ensure your business remains operational.
“It is no good having the data in the cloud if you can’t recover it quickly enough to meet your RTO and RPO requirements”, says Trossell before adding: “Too many organisations are turning a blind eye to this, and one copy in one place is not a level of protection that most auditors should agree with.” Using this situation as an example, he asks: “What would you do if AWS lost your data?”
Without having the ability to restore your data from several sources, your organisation would suffer downtime. This can in some cases lead to financial and reputational damage, which should be avoided at all costs. He therefore advises you to work with cloud providers that have service level agreements in place that guarantee that your data will always be recoverable whenever you need it.
Furthermore, with regards to whether public cloud is superior to on-premise data centres, he says: “Clouds provide an invaluable service, but it is not the right answer for every circumstance. They are not efficient and cost effective for large long term use.” The concerns over cloud security haven’t gone away too, and particularly because there is a shortage of people with cyber-security skills. Shadow IT is another issue that is pulling back cloud adoption – and even though some types of cloud are more secure than others. Organisations are also equate the cloud with a loss of control over their IT. These factors therefore need to be considered when you to decide what should or shouldn’t go into a public cloud, a private cloud or a hybrid cloud.
According to Sharon Gaudin’s article for Network Asia of 28th April 2019, ‘IT leaders say it’s hard to keep the cloud safe’, cloud adoption is slowing down rather than accelerating because of it. In spite of this 62 per cent of companies, reveals a recent survey by Intel of 2,000 IT professionals working in different countries, are storing sensitive customer in the public cloud. You could therefore argue that public cloud and cloud adoption overall creates a number of contradictions. On one hand recent trends have seen an uptake in the public cloud, but on the other hand the same issues still arise. These can have the effect of slowing down cloud adoption – and not just to the public cloud.
Yet in the case of the S3 AWS outage, blame needs to lie with human error and not with the public cloud. Trossell therefore concludes that you should consider data acceleration solutions such as PORTrockIT – the DCS Award’s ‘Data Centre ICT Networking Product of the Year’ to remove the human risk associated with, for example, the manual configuration of WANs. They can also help your organisation to maintain uptime by enabling real-time back up at speed by mitigating the effects of latency and by reducing packet loss. They can permit you to send encrypted data across a WAN too, and your data centres needn’t be located next to each other within the same circles of disruption because they can be many miles apart from each other.
So, with them in mind, the public cloud certainly isn’t outed for disaster recovery. It can still play an invaluable disaster recovery role. With data acceleration supported by machine learning you will be able to securely back up and restore your data with improved RPOs and RTOs to any cloud. You also won’t have to suffer downtime caused by a simple typographical error. The network will be configured for you by machine learning to mitigate data latency, network latency and packet loss.
Graham Jarvis, business and technology journalist
Image Credit: Everything Possible / Shutterstock