Facebook has had a bad couple of days, with an brief outage on Wednesday being followed by a more serious problem yesterday lasting for over two and a half hours - the worst bout of downtime the company has had in four years.
In a posting to the company's blog, engineer Robert Johnson described the "worst outage we've had in over four years" as being the result of "an unfortunate handling of an error condition."
Explaining that, "an automated system for verifying configuration values ended up causing much more damage than it fixed," Johnson said: "Every single client saw [an] invalid value and attempted to fix it," following a perfectly valid configuration change - and with so many clients attempting to repair the 'damage', "because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by thousands of queries a second."
The problem was compounded by an error handling condition which left clients that attempted to access the now-deleted key seeing their lack of access as a problem to be fixed in and of itself, meaning, "even after the original problem had been fixed, the stream of queries continued."
The solution to the feedback loop that Facebook's databases found themselves in was one which will be familiar to anyone who watches The IT Crowd: they tried turning it off and back on again. Literally.
Johnson describes the resolution as involving engineers having to "stop all traffic to this database cluster, which meant turning off the [Facebook] site."
Johnson has promised that the situation is now resolved for good, with the badly-coded configuration checking system disabled until it can be re-written to understand the concept of a feedback loop and prevent future outages.
Perhaps Facebook founder Mark Zuckerberg can spend some of his billions on a few more database engineers...