Lessons learned from Amazon S3 Outage

Lessons learned from Amazon S3 Outage

By: Benjamin Roussey
From: TechGenix

February 28 will be remembered as a rude reality check for the global IT enterprise. Reason — Amazon Web Service’s Simple Storage Service (S3) was down for nearly four hours. Forrester Research cloud analyst Dave Bartoletti has famously compared AWS S3 to “air” in the “cloud” context; that’s how big it is. The downtime affected thousands of websites, in quantifiable and unquantifiable terms both.

How did it happen?

The S3 team was trying to get to the root cause of a payment system problem. During this debugging exercise, a command intended to remove a small number of servers from a subsystem was executed. An error in the input, however, pulled a large number of servers offline.

continue reading...