O2 outage latest in string of major IT infrastructure failures

O2 customers faced a second day of outages, after voice calls, texts, and data went down early in the afternoon on Wednesday (11 July). The outage also affected GiffGaff and TescoMobile subscribers, as both of those services use O2’s infrastructure.

O2, which bills itself as an “award-winning network” and uses the tagline, “We’re better connected,” is the UK’s second-largest provider, with some 22 million customers, hundreds of thousands of whom are believed to have been without mobile services since yesterday.

In its first admission of the outage, O2 tweeted at 13:23 on Wednesday, “There’s a problem affecting some customers’ mobile service. Engineers are working to restore full service asap.”

The company went on to say that the problem was being dealt with “as a priority.”

But the outage seems to have worsened overnight. Some customers who were able to use their phones yesterday evening woke up to find they were unable to place calls, send texts, or use data. In our office, an iPhone 3G and HTC One X on O2 have both experienced on-and-off outages since yesterday evening, but service to those devices appears to have been restored as of midday on Thursday.

O2 has said the problem is “due to a fault with one of our network systems, which has meant some mobile phone numbers are not registering correctly on our network.”

The operator has thus far not shared many details regarding why only some customers have been affected, saying only that it was random. That is ostensibly why some O2 subscribers have not been able to use their mobile services while others in their immediate vicinity experienced no problems.

"We can confirm that our 2G network service has now been restored. Customers who were affected should now be able to make and receive calls,” O2 said in a statement early on Thursday morning. “Our 3G service is starting to restore and customers should expect to see a gradual return of data services as the day progresses. Customers affected may wish to try switching their mobile phones off and on as service returns.”

Indeed, switching off 3G or restarting a handset altogether managed to kickstart some customers’ access to 2G voice and text services, though it has been inconsistent and did not work for all O2 subscribers.

O2 said customers could expect a “gradual return” of data services over the course of the day and said it expected “full service to return to all affected customers this afternoon.” At 13:39, the company said that "tests show that 2G and 3G services are now back for all affected customers."

But as the outage drags into its second day, with some indications that the problem worsened rather than improved overnight, it raises questions about the soundness of large-scale IT infrastructure.

“We have extensive continuity plans which we brought into effect to restore service as quickly as we could,” the company wrote in a blog post, though that restoration of service has evidently not been quick enough for customers left unable to use their devices.

In late June, some O2 users were unable to send text messages due to a network failure, which the company has said is not related to this latest incident.

Also in June, a software glitch caused NatWest databases to go down, preventing customers from receiving or processing payments and locking them out of their balances. Weeks later, some NatWest, RBS, and Ulster Bank customers continue to be affected, ostensibly because of a backlog of transactions caused by the original error.

In October 2011, a global outage caused by a “core switch failure” left BlackBerry users without access to Internet services for three days.

Taken together, these instances point to a far-reaching problem within major companies’ IT solutions. While outages and system failures are inevitable, the consistent failure to immediately address and resolve those issues is alarming.

Considering that many aspects of day-to-day life rely on the smooth functioning of large-scale IT systems, companies like O2 and NatWest must re-evaluate the infrastructures they have in place. As was highlighted by the NatWest outage, many run on decades-old systems, onto which new functions are continuously added, making them increasingly complex and more difficult to manage.

Because such systems are often based on different components working in sync, a lot of time can be wasted trying to identify the original cause of an outage. That time would be better spent developing and implementing solutions quickly, especially when customers are left without access to essential services like banking and telecommunications.

Also, companies tend to use their social media channels to send out apologies and make vague promises that a solution is forthcoming, but otherwise leave their customers out of the loop with regards to details of the problem.

That’s added to customers’ frustrations, as evidenced by the thousands of angry Tweets and Facebook posts in response to the O2 outage.

O2 has not indicated how, or whether, it plans to compensate customers for the outage. Precedents have seen apps being given away for free, but it’s likely many customers would prefer to see some form of compensation reflected in their monthly bills.

There’s no word yet on whether this network failure will be sufficient grounds for users to leave their contracts without paying a penalty, though the two-day outage certainly is an incentive to switch to another provider. Some Twitter users questioned whether O2’s failure to provide mobile services since yesterday afternoon could legally be considered a breach of contract.

With the Olympics fast approaching and the influx of visitors to London certain to overburden mobile networks, the incident is not a good omen for mobile services in the weeks to come.