Tuesday, September 24, 2013

Google Explains Outages As "Dual Network Failure"

Google users may have experienced several delays with their email and possibly even other Google related services yesterday. While today things seem back to normal we now how a bit of an explanation as to what happened. Google has blamed the outage on "dual network failure", exactly what that might mean is as of yet completely unclear.

Google issued the official apology yesterday via the official Gmail Blog. The company explained that only 29% of messages were delayed by a couple of seconds, with only 1.5% of all messages being critically delayed by a couple of hours. The issue lasted for about 8 hours, starting at around 5:30 am PST. Additionally, the company went further in detail explaining what had happened, by issuing this statement:

On September 24th, many Gmail users received an unwelcome surprise: some of their messages were arriving slowly, and some of their attachments were unavailable. We’d like to start by apologizing—we realize that our users rely on Gmail to be always available and always fast, and for several hours we didn’t deliver. We have analyzed what happened, and we’ll tell you about it below. In addition, we’re taking several steps to prevent a recurrence.

The message delivery delays were triggered by a dual network failure. This is a very rare event in which two separate, redundant network paths both stop working at the same time. The two network failures were unrelated, but in combination they reduced Gmail’s capacity to deliver messages to users, and beginning at 5:54 a.m. PST messages started piling up. Google’s automated monitoring alerted the Gmail engineering team within minutes, and they began investigating immediately. Together with the networking team, the Gmail team restored some of the network capacity that was lost and worked to repurpose additional capacity, clearing much of accumulated message backlog by 1:00 p.m. PST and the remainder by shortly before 4:00 p.m. PST.

The impact on users’ Gmail experience varied widely. Most messages were unaffected—71% of messages had no delay, and of the remaining 29%, the average delivery delay was just 2.6 seconds. However, about 1.5% of messages were delayed more than two hours. Users who attempted to download large attachments on affected messages encountered errors. Throughout the event, Gmail remained otherwise available — users could log in, read messages which had been delivered, send mail, and access other features.

What’s next? Our top priority is ensuring that Gmail users get the experience they expect: fast, highly-available email, anytime they want it. We're taking steps to ensure that there is sufficient network capacity, including backup capacity for Gmail, even in the event of a rare dual network failure. We also plan to make changes to make Gmail message delivery more resilient to a network capacity shortfall in the unlikely event that one occurs in the future. Finally, we’re updating our internal practices so that we can more quickly and effectively respond to network issues. We’ll be working on all of these improvements and more over the next few weeks—even including this event, Gmail remains well above 99.9% available, and we intend to keep it that way!

For most users a Gmail outage is always a big deal, and a major concern. End users and companies alike are often affected and these delays can grind a company relying upon the service to a halt. Even though Gmail’s disruption was limited to a small portion of messages and service such as Google Docs and Google presentations – the company has confirmed there should be no loss of data, messages or otherwise and all services are now up and running smoothly.

No comments:

Post a Comment

All comments will be moderate for content, please be patient as your comment will appear as soon as it has been reviewed.

Thank you