ONLINE BOOKSELLER Amazon has explained how a load balancing issue at its facilities took down the Netflix TV on demand service on Christmas Eve.
Netflix went down in parts of the US on Christmas Eve, a time when people like to watch films.
Netflix admitted to the problems on its Twitter feed, explaining that as soon as it became aware of the issue its engineers started working on a fix.
We're sorry for the Christmas Eve outage. Terrible timing! Engineers are working on it now. Stay tuned to @netflixhelps for updates.— Netflix US (@netflix) December 25, 2012
Judging by the time between that tweet and another one that said the problem had been fixed, Netflix service was interrupted for about four hours.
Netflix had called on Amazon's web services engineers to help it, and Amazon has issued its own explanation of the problem and how it affected Netflix and its other customers.
"While the service disruption only affected applications using the Amazon Elastic Load Balancing Service (ELB) service (and only a fraction of the load balancers were affected), the impacted load balancers saw significant impact for a prolonged period of time," it said.
"A portion of the ELB state data was logically deleted [by] a maintenance process that was inadvertently run against the production ELB state data... Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers."
Amazon added that while this situation continued some of its customers "began to experience performance issues with their running load balancers".
It said that it has fixed the ELB to prevent the same thing from happening again, permission must now be granted before data can be deleted, and it apologised to all of its business users.
"We want to apologize," it added in closing its lengthy apology. "We know how critical our services are to our customers' businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service." µ