Amazon has offered a detailed explanation of a Christmas Eve outage that took down the services of clients like Netflix.
In a nutshell, a developer accidentally deleted some data from the Amazon Elastic Load Balancing Service (ELB). It took Amazon some time to figure that out, and when it did, an initial recovery effort failed, prolonging the Netflix outage.
"We want to apologize," Amazon said in a note posted on its AWS website. "We know how critical our services are to our customers' businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service."
Netflix users started reporting problems with the service's Watch Instantly service on 24 December. The partial outage affected "some, but not all devices that can stream from Netflix," the company said at the time. Service was restored by Christmas.
Alas, it appears the downtime was the result of human error. According to Amazon's calculations, a developer deleted a portion of ELB data at 12:24 PST (20:24 GMT) on 24 December.
"This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example tracking all the backend hosts to which traffic should be routed by each load balancer)," Amazon said.
"Unfortunately, the developer did not realize the mistake at the time," Amazon continued. "After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers."
Since Amazon didn't realise that the data had been deleted, its team initially focused on the API errors. "The team was puzzled as many APIs were succeeding (customers were able to create and manage new load balancers but not manage existing load balancers) and others were failing," Amazon said.
As a result, it took Amazon several hours to figure out that data had been deleted. When it did, around 17:00 PST (01:00 GMT), the team disabled several of the ELB control plane workflows and recovered some data. They tried to restart the system by bringing it back to the state in which it was in just before the data deletion, but that "failed to provide a usable snapshot of the data."
Amazon then "began slowly re-enabling the ELB service workflows and APIs," the majority of which was complete by 8:15 PST (16:15 GMT) on 25 December. Everything was mostly back to normal around 10:30 PST (16:30 GMT) on Christmas Day, but "the team continued to closely monitor the service before communicating broadly that it was operating normally at 12:05 a.m.," Amazon said.
To make sure the same error doesn't happen again, Amazon will now require developers to get approval before deleting data.
"The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated," Amazon said. "This access was incorrectly set to be persistent rather than requiring a per access approval."
"We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval," Amazon continued. "This would have prevented the ELB state data from being deleted in this event."
Amazon said it also modified its data recovery process so that it hopefully does not fail in the future.
Meanwhile, Netflix this week experienced some connection problems on its Netflix.com website, but streaming was unaffected. Some users saw error messages when navigating to the Netflix.com website.