The blackout was lasted several hours affecting tens of notable sites including Foursquare, Quran, Moby and Reddit. Many large EC2 users ended up losing valuable business data. Chartbeat reports losing 11 hours of historical data to its customers saying it’s “unrecoverable.”
Amazon said the problems were due to a power failure, but did not provide further details on the origin of the problem that caused the blackout.
“A few days ago sent an email letting you know we were working on the recovery of an inconsistent data snapshot of one or more of your Amazon EBS volume. We’re sorry, but ultimately, our efforts to recover the volume manually were unsuccessful. The hardware is not such that it could not restore forensics data.”
As per report, Amazon’s data center in Ashburn, Virginia, lost power for about 30 minutes.
“We can confirm network connectivity issues for some EC2 instances in a single Availability Zone in the US-EAST-1 region,” Amazon reported in its Service Health Dashboard. “Customers may be experiencing impaired read/write access to their EBS (Elastic Block Storage) volumes. New instance launches are also delayed. We are applying mitigations to address the connectivity issues … and connectivity is beginning to recover.”
Amazon further added, “We know how important our business services our customers and we will endeavor to learn from this event and use it to drive improvement in our services.”
It is hard to believe that a cloud service, as reliable as EC2, does not maintain a foolproof backup system. Amazon EC2, Rackspace, Google Apps and Microsoft Azure have had their fair share of breaks in the last 18 months and some of them have been big failures (in April 2011 Amazon interruption lasted 47 hours for some customers).
A recent report from International Working Group on Cloud Computing Resiliency (IWGCR) says customers have suffered 568 hours of downtime from 13 well-known cloud services since 2007, which resulted in $71.7 million of economic loss.
This puts a big question mark over the reliability of the cloud, and objects of the popular perception of infallibility cloud. A system, as redundant as it is, is not immune to failure, human error, software bugs, etc.
Recoverability of the system becomes an important issue when such incidents occur. Many organizations are not serious to restore the system and test it before the incident. Regularly backing up data and store it away from your Primary CSP. For example, you may have an instance of Amazon EC2 to back up the installation of the Rackspace cloud. This will mitigate against a single point of failure.
Create a disaster management system that also includes the preparation of its public relations staff and customer service, establish quality control processes and implementation of a contingency plan at the executive level to avoid the panic of securing business.
In addition, the geographical distribution of a critical application on multiple data centers can prevent network failure located in one datacenter.