Today, Amazon suffered a major outage of their EC2 cloud services based out of their Virginia data center. There are plenty of other blogs with more technical details on what specifically took place. Many cloud pundits are pointing to the outage as another example of the immaturity of cloud-based infrastructures. I think this is overblown.
In past missives, I outlined examples of past outages:
- Oct 14, 2009 Microsoft Sidekick Data Loss
- Jun 29, 2009 Rackspace Data Center Outage
- May 14, 2009 Google Outage
- Mar 21, 2009 Carbonite Storage Failure
While the dust-up of Amazon is fresh, outages of infrastructure are something to expect. We expect them in our own data centers. So are we back to expecting a double standard with cloud providers? In the case of Amazon, is the expectation that a higher class of service is delivered for a fraction of the price compared with internally provided data center services? Really?
Outages happen all the time in cloud data centers. Most of those outages are never observed or significantly impact users. Why? In most cases, simple tiers of redundancy are used to lower the statistical probability that an outage will occur. Yes, I said lower the statistical probability…not eliminate it. That’s all that redundancy does.
Then why did these outages happen in such a large and public cloud offering? At some point, one has to make a business decision as to how much redundancy is valuable. It’s easy to take pot shots from the outside of a cloud provider and looking inward. But these same challenges exist within traditional data centers too. And not all redundancy is infrastructure-based. Application architectures must consider the risks too.
I submit that it is time that we need to consider a different approach to how we provide services. I’m not referring to IaaS services. I’m referring to application-level services (SaaS in many ways). Our application architectures have relied on redundant infrastructure at the most basic levels for some time. That includes networks, servers, storage and so on.
This may sound like a pipe dream, but application awareness needs to move much higher in the OSI stack. If you think about it, SaaS applications do this to some degree. Do you know which data center is serving data when visiting http://www.google.com/? No. But when you put that in your browser, it works. Why is that? Does that mean that Google doesn’t have infrastructure failures? Do they have applications failures? Of course they do. But they’ve architected their applications and infrastructure to be resilient from failures.
In the case of the Amazon failure today, if the client applications were architected to leverage multiple Amazon data centers, would they have experienced an outage? While it may not have eliminated the entire outage for clients, it most likely would have reduced the impact. From the initial reports, the outage appears to be isolated to Amazon’s Virginia data center.
Some will argue that data sets are the Achilles Heel and prevent this type of redundant application architecture. I would propose that maybe we just haven’t figured out how to deal with it yet.
Bottom line: Failures are a reality in private data centers and in the cloud. We need to stop fearing failure and start expecting it. How we prepare our services and applications to respond to failure is what needs to change.