Amazon S3 outage update : Finger Trouble
Amazon have been refreshingly frank in a release explaining the recent East Coast USA outage of their S3 storage service. This hit a large number of major online brands with reduced or no storage availability. Seems finger trouble in DNS configuration and a dodgy algorithm took down more servers than expected as part of a planned outage.
an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes …
the process of restarting [these] services and running the necessary safety checks to validate the integrity of the metadata took longer than expected … By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally.
Clearly even the best of the best can’t be perfect and emphasises the need for diligence in contingency and failover planning in any cloud migration / installation. Top marks to AWS for their full and open information release