Google promises more attention to patches after cloud outage

An outage which hit Google Compute Engine (GCE) over the weekend left its users with 43 minutes of downtime so Google has issued an apology, an explanation and a promise for better patch preparation in the future.

The new outage started on 7 March 2015 at 09:55 PST and was caused by "packet loss on egress network traffic."

The user impact of this intermittent packet loss depended on VM, zone, and user netblock, and ranged from no visible impact, to unusually slow responses, to timeouts attempting to contact the VM.

It took Google 43 minutes to remedy the damage. The company said virtual machines stayed up, but it was a botched patch which messed things up.

"The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM.

"The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner,“ said Google.

The outage was minor, with no severe damages, but Google says it will pay more attention in the future.

"Google engineers are investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production. Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident.

"Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior,“ the company stated.