Remember that big outage that happened last week to Azure, Office 365 and Dynamic users? Well, Microsoft has shed some more light on what the underlying causes were.
Publicly releasing root causes to the issue, Microsoft said there were three separate problems that led to the downtime. The first two occurred after a code update, which Microsoft rolled out by Friday, November 16. Those were latency issues in Microsoft Azure Active Directory Multi-Factor Authentication Service's (MFA) front-end's communication to its cache services.
The second one was a race condition in processing responses from the MFA back-end server, while the third one came as a result of the second issue. It manifested itself as the MFA back-end not being able to process any requests from the front-end, despite everything looking just fine on Microsoft's monitoring.
“We sincerely apologize for the impact to affected customers” Microsoft said in its explanation of the problem. “We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.”
These steps that the company talks about include a review of its update-deployment procedures, a review of its monitoring services, as well as a review of the containment process which helps avoid propagating issues to other data centres. And finally, it will update the communication process for the Service Health Dashboard and monitoring tools.
Image Credit: Dennizn / Shutterstock