As too many of us know, Netflix Christmas Eve crashed and burned. The outage lasted from 1 pm to after midnight Pacific time. Once again, it was Amazon Web Services (AWS) where the problems occurred. Specifically, the “Elastic Load Balancer” started generating errors at 1:50 pm Pacific time according to the AWS Service Health Dashboard. To Amazon’s great credit, they are transparent about problems that occur and keep their log files reasonably up to date with progress reports. But, let’s face it, a service outage of ten hours is unacceptable. Wired.com has pretty good coverage, as does Techcrunch.com. Others have covered the basic story, so all I’ll try to do here is fill in some details and add some editorial commentary. But first, some humor.
Best Comments
Best comment so far is on Wired.com:
“Daren_Gray • North Virginia. Maybe the NSA needed the server space to make sure their holiday naughty/nice list does not omit a single American. Somewhere over Whoville a drone waits for a kill order. Netflix will have to wait.”
wow netflix depending on AWS for streaming while Amazon VOD humming awesomely. what an irony!”
Strategic Blunder by Netflix
As I’ve written elsewhere, Amazon is now a Netflix competitor. A one-year membership in Amazon Prime for about $80 gives you access to Amazon streaming video, with many titles available for free. And you also get the infamous free shipping on products shipped by Amazon (third-party sellers are excluded from this deal). I wrote that article in August, 2012, shortly after our household signed up for Amazon Prime. Netflix has had months to develop a backup strategy. And this is not the first AWS outage — far from it. But Netflix has apparently done nothing. Reed Hastings may be a great technologist and venture capitalist, but he’s terrible at strategy. By contrast, Amazon seems to understand a bit about game theory. What better time to demonstrate Netflix’s vulnerability than Christmas Eve?
I’m certain that all parties will deny any intention. And senior management may well not know what actually happened. Possibly some junior people took it upon themselves to crash Netflix. But the timing and nature of the outage make the probability of coincidence very low. Netflix seems to have been just about the only service impacted — and on one of their busiest days of the year.
My lovely wife has proposed an alternative hypothesis. Her guess is that Netflix and AWS were confronted with a new situation on Christmas Eve: multiple streaming requests from the same household. After all, why should the kids watch White Christmas when Bad Santa is also available? (Disclaimer: I have no idea whether either movie is available on either Amazon or Netflix.) Multiple requests from a single point of service may well have affected service. Intuitively, it seems logical that the failure would occur in the load balancing system. You, of course, are free to believe either of these stories — or make up your own!
Outage Details
Below is the entire incident report from the AWS Service Health Dashboard:
“Dec 24, 1:50 PM PST We are currently investigating an issue regarding Elastic Load Balancer.
Dec 24, 2:18 PM PST We are investigating increased error rates for Elastic Load Balancing API calls in the US-EAST-1 region. This issue does not affect traffic on running load balancers, but the API may report that instances are out of service for some calls.
Dec 24, 3:02 PM PST We can confirm increased error rates for Elastic Load Balancing API calls in the US-EAST-1 region and continue to work towards resolution. This issue does not affect traffic on running load balancers, but the API may report outdated or incomplete information such as not reporting listeners or marking the ELB is out of service for calls to DescribeLoadBalancers.
Dec 24, 4:15 PM PST We continue to experience increased errors for Elastic Load Balancing API calls in the US-EAST-1 region and continue to work towards resolution. This issue is impacting traffic for existing ELBs. New ELBs are provisioning correctly. The ELB APIs continue to report outdated or incomplete information such as not reporting listeners or marking the ELB as out of service for calls to DescribeLoadBalancers.
Dec 24, 5:49 PM PST We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. Traffic for some ELBs are currently experiencing significant levels of traffic loss.
Dec 24, 8:46 PM PST We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. The issues are affecting both existing and newly created ELBs.
Dec 24, 10:20 PM PST We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. The issues are affecting both existing and newly created ELBs. While we are in the process of recovering the service, API calls for creating, modifying, or deleting ELBs will be disabled. We expect to restore full API service once the issues are resolved. We apologize for the impact of these issues on our customers.
Dec 25, 4:36 AM PST We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. These issues are affecting updates to both existing and newly created ELBs. A subset of ELBs that made configuration changes or changes to registered instances during the event are experiencing errors or receiving reduced traffic. We continue to work toward a full recovery of the service. We apologize for the continued impact.
Dec 25, 9:41 AM PST The Elastic Load Balancing Service APIs have been fully re-enabled, and load balancers are functioning and provisioning correctly. Customers may experience slightly elevated throttling and latency with ELB related APIs while the system completes recovery operations. We are continuing to monitor and audit all load balancers and APIs
Dec 25, 12:02 PM PST The Elastic Load Balancing Service is operating normally. We have confirmed all load balancers are functioning correctly, and API latencies and error rates for ELB related APIs have returned to normal. ”
Conclusion
Will we ever know the truth? Maybe. I’m certain that Ars Technica reporters are all over this story. I offer these thoughts to give possible guidance in pursuing the story.
In the meantime, Merry Christmas! Get back to partying.