Google outage: A broken cloud !

On May 14th, Google has experienced one of its largest public outage. Not only Google homepage was down but it also affected most of Google's services including Gmail, Google Maps, Google Calendar, Google Apps and AdSense. Google has taken responsibility for the problem. Urs Hoelzle, SVP of Operation at Google, released the following statement:

Imagine if you were trying to fly from New York to San Francisco, but your plane was routed through an airport in Asia. And a bunch of other planes were sent that way too, so your flight was backed up and your journey took much longer than expected. That’s basically what happened to some of our users today for about an hour, starting at 7:48 am Pacific time.

An error in one of our systems caused us to direct some of our web traffic through Asia, which created a traffic jam. As a result, about 14% of our users experienced slow services or even interruptions. We’ve been working hard to make our services ultrafast and “always on,” so it’s especially embarrassing when a glitch like this one happens. We’re very sorry that it happened, and you can be sure that we’ll be working even harder to make sure that a similar problem won’t happen again. All planes are back on schedule now.

This is as bad as it seems from my perspective and could cast a real shadow on Google’s cloud strategy ie. Google Apps. The outage comes the day after Google announced an unprecedented partnership with Valeo which is planning to deploy Google Apps to the company’s entire office-based workforce. We’re talking 30,000 users ! This is bit ironic and frightening to other company planning to move in the cloud. Additionally, what’s the financial impact on website built on Google API? Can you imagine the impact on online services relying on Adwords for their online paid advertising ? Scary.

After the search engine problem in January, Gmail breakdown in February, this is another wake-up call for companies relying entirely on Google ‘free’ services. It looks like there is no such things as a free lunch (TINSTAAFL !) after all …

I’d be interested to get technical details about the actual problem which raises some questions: What type of error in their system affected web traffic routing ? Did they tried to implement new features for some of their services without testing them properly on a staging environment ? I have a lot of respect for the inovation in testing Google brings to the software industry. It’s very refreshing and bring a lot of motivation to everyone in this field. I sure hope their testing was not at fault on that one …

