Reddit Amazon

Amazon says sorry - in 5700 words - for the cloud cock-up

Author

By The Drum Team, Editorial

May 1, 2011 | 3 min read

When a giant stumbles, the ground shakes. As our reliance grows on cloud computing, last week's misstep by Amazon in the US was a major event. Now the company has explained what went wrong and how they hope to avoid another event like it in future

The company said a network configuration change caused the shutdown - described by one commentator as " a major stumble" for the service - and described what it is planning to prevent similar technical problems in future.

"We want to apologise," Amazon said on its website. "We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services."

Last week's super-glitch began on April 21 and lasted for some customers up to three days. . Sites such as Foursquare, Quora and Reddit were hit . Eight days after the incident, Amazon was still restoring some of the computers brought down.

Amazon said the primary outage occurred in a data center near Dulles Airport in northern Virginia when a configuration change to shift traffic was "performed incorrectly." The network change was part of a normal scaling activity and was intended to upgrade capacity of the primary network in the data center. "Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another," Amazon said. An automated error-recovery mechanism then went out of control, and many computers became "stuck" in recovery mode. Amazon's Web-based services allow users to run programmes and store information remotely. They access the applications over the Internet and remove the cost of operating the equipment themselves. Amazon promised it will learn from last week's problems and do whatever it takes to prevent a similar event happening in the future. "As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes," Amazon said.
Reddit Amazon

More from Reddit

View all

Trending

Industry insights

View all
Add your own content +