When a large portion of the internet went offline earlier this week, no one could have guessed that the reason for it would be a simple typo. Yet, that’s exactly what happened, as Amazon gave an explanation to the incident.
Earlier this week, a number of big websites (and an even greater number of smaller ones) went offline for five hours – Trello, Lonely Planet, Medium, IFTTT, Quora, and pretty much every site built on Wix.
All the sites are hosted by Amazon’s Web Services, and that’s where the things have gone haywire. Apparently, its employees in northern Virginia were investigating a problem with slower billing. As a fix, they intended to remove a small number of servers. They ended up removing a large number of servers, and that’s when our headaches started.
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” Amazon said. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
It took them a while to get things back on track.
“Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”
Amazon apologized for the incident and promised to “do everything we can to learn from this event.”