An electrical storm in Dublin has knocked both Microsoft and Amazon's respective cloud computing platforms offline, highlighting one of the issues at the heart of the cloud computing revolution: reliability.
Amazon's Elastic Compute Cloud - also known as EC2 - and Microsoft's Business Productivity Online Suite were both taken out this weekend following a severe electrical storm in Dublin, which plays host to major data centres for a range of cloud computing services for the European market thanks to a cool climate and friendly tax breaks.
Amazon has been the most forthcoming about the outage, blaming the issue on a lightning strike that overloaded a transformer and caused an explosion which knocked out power to the company's facility. "Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators," the company explained to customers, "but the transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronises the backup generator plant, disabling some of them."
As a result, the backup generators couldn't be brought online until they were manually synchronised to the facility's phase, meaning that the facility was left without power and Amazon's customers without access to their European instances.
While Microsoft has been tight-lipped on the exact nature of its own Dublin outage, it has confirmed via its Online Services Twitter feed that "data centre power issues" disconnected its users for a period of some hours, suggesting a similar issue as Amazon's lightning strike.
The two large-scale outages have highlighted one of the biggest issues facing companies looking to transition to cloud computing: reliability. While the idea of a flexible instance-based cloud computing model should, in theory, increase availability by allowing users to quickly spawn additional instances in different regions, depending on load and localised issues, the act of passing off responsibility to a cloud computing provider does leave businesses with less overall control over how facilities are managed.
Worse, as Amazon's outage has shown, a single problem can knock a whole swathe of systems offline: Amazon's Dublin data centre hosts thousands of EC2 instances for European companies, and its recent acquisition of 240,000 square feet of space for additional expansion will soon have the company hosting thousands more. A single occurence, such as a lightning strike, while rare, can potentially affect thousands.
At the time of writing, Microsoft's BPOS is once again fully operational while Amazon's EC2 is still encountering problems. "We know many of you are anxiously waiting for your instances and volumes to become available and we want to give you more detail on why the recovery of the remaining instances and volumes is taking so long," Amazon told customers in a statement.
"Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours until the process is completed."