THE LIGHTNING STRIKE that took down many of Amazon's hosted and on-demand services wasn't lightning, according to a new statement from the bookseller and cloud vendor.
Earlier this month we reported that a lightning strike at a power utility in Ireland had knocked out power to Amazon's services, which in turn knocked over services at customers including - we cant think why anyone would want to use that mobile app - Foursquare. This, we learn, was wrong.
"What we have is preliminary, but we want to share it with you," read the earlier explanation on the Amazon web services status pages. "We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire."
However, now it seems that it was something else that caused an explosion and resulting fire.
"The service disruption began at 10:41 AM PDT on August 7th when our utility provider suffered a failure of a 110kV 10 megawatt transformer. This failure resulted in a total loss of electricity supply to all of their customers connected to this transformer, including a significant portion of the affected AWS Availability Zone," explained the firm in its latest advice on the subject.
"The initial fault diagnosis from our utility provider indicated that a lightning strike caused the transformer to fail. The utility provider now believes it was not a lightning strike, and is continuing to investigate root cause."
With confidence in the cloud currently mixed at best, incidents like this are not particularly reassuring. We can understand how a lightning bolt might hit a utility and hurt its infrastructure causing services to fail and fires to start, but positioning the blame elsewhere raises a lot of questions. Was it gremlins, for example?
According to Amazon what usually happens when a power utility fails is that the datacentre load is picked up by backup generators and synchronised by Programmable Logic Controllers. However, in this case one PLC decided not to complete its connection to a range of generators and as a result the datacentre power failed.
Amazon did not mention it, but abnormalities in PLCs can be attributed to Stuxnet attacks, as that industrial systems virus commonly accesses their data blocks in its earlier stages. If perhaps this had happened in this case, then perhaps it is understandable that lightning was cited as a reason. However, for now no one is making that suggestion.
"We currently believe (supported by all observations of the state and behavior of this PLC) that a large ground fault detected by the PLC caused it to fail to complete its task. We are working with our supplier and performing further analysis of the device involved to confirm," it added.
"With no utility power, and backup generators for a large portion of this Availability Zone disabled, there was insufficient power for all of the servers in the Availability Zone to continue operating."
Amazon has promised to try to mitigate the risk of further outages through the use of more backups and closer work with its vendors, and to keep its users better informed about problems at its sites. Which is good, especially when you recall that some of the affected big hitters in the so-called lightning strike felt that they were in as much darkness as the Amazon service.
Amazon said that it will add more redundancy and more isolation to its PLCs, in order to prevent failures from spreading, and will create a new "environmentally friendly" backup PLC along with its vendors. Amazon added that it will deploy this addition as soon as possible.
Load balancing will also be improved, and recovery times, which the company described as "long", will be "drastically shortened". µ