A little over two weeks ago, large sections of the internet ground to a halt. Many sites, from Netflix and Spotify to Pinterest and Buzzfeed were unresponsive, during a four-hour outage of Amazon Web Services’ (AWS) S3 – Simple Storage Service. For two of those four hours, AWS couldn’t update the service status from “Service is operating normally” because the status itself depended on the storage service’s operation. Should you trust your data with AWS? Why shouldn’t we start calling S3 the “Sometimes Storage Service?” How could this event be a good thing?
According to AWS’ statement on the service disruption, the problem was limited to the Northern Virginia or “US-EAST-1” region. This is the default region, one of AWS’ oldest, and the one in which Amazon rolls out new features, so it is used disproportionately to other regions. Apparently, an operator error caused the service’s indexing component to drop below a critical number of servers. Due to the growth of the service, restarting it took significantly longer than expected.
What will they do about it?
Amazon has identified a number of key changes to prevent such disasters in the future. Among them, the tool which removed the capacity has been modified to remove capacity slower and with safeguards to ensure a drop below critical thresholds doesn’t happen. Other operational tools are being given similar safety checks. The service status has been changed to not be dependent on the service to notify of a failure. Changes have also been made to improve recovery time in the event of catastrophic failure.
What should you do about it?
For cloud customers using S3 for critical storage operations, we recommend a feature called cross-region replication. While the default service writes to multiple datacenters, or “Availability Zones” in the AWS parlance, cross-region service would allow operations to continue on the secondary region in the event of a failure.
It’s also worth keeping in mind that big service disruptions aren’t unique to AWS. Customers who are interested in disaster-proofing their systems should consider using a cross-cloud storage solution, including S3, even if their primary cloud provider isn’t AWS.
Finally, automate every process you can in a critical system, to reduce the risk of human error.
What didn’t happen?
Ultimately, the only customer data that was lost to the failure were writes that happened without a failover during the first two hours of the outage, before they stabilized the index service. Not bad for a critical error on a service that has "north of three to four trillion pieces of data stored in it," according to Dave Bartoletti, an analyst with Forrester.
But seriously, a good thing?
What this event demonstrates is that even the most resilient of services is subject to failure. Steps have been taken to limit the impact of operator error. System architects have studied this event carefully and learned many lessons. Overall, this brief outage will result in better, more resilient systems in the future.