Amazon Outage Concerns Are Overblown

Today, Amazon suffered a major outage of their EC2 cloud services based out of their Virginia data center. There are plenty of other blogs with more technical details on what specifically took place. Many cloud pundits are pointing to the outage as another example of the immaturity of cloud-based infrastructures. I think this is overblown.

In past missives, I outlined examples of past outages:

  • Oct 14, 2009   Microsoft Sidekick Data Loss
  • Jun 29, 2009   Rackspace Data Center Outage
  • May 14, 2009  Google Outage
  • Mar 21, 2009   Carbonite Storage Failure

While the dust-up of Amazon is fresh, outages of infrastructure are something to expect. We expect them in our own data centers. So are we back to expecting a double standard with cloud providers? In the case of Amazon, is the expectation that a higher class of service is delivered for a fraction of the price compared with internally provided data center services? Really?

Outages happen all the time in cloud data centers. Most of those outages are never observed or significantly impact users. Why? In most cases, simple tiers of redundancy are used to lower the statistical probability that an outage will occur. Yes, I said lower the statistical probability…not eliminate it. That’s all that redundancy does.

Then why did these outages happen in such a large and public cloud offering? At some point, one has to make a business decision as to how much redundancy is valuable. It’s easy to take pot shots from the outside of a cloud provider and looking inward. But these same challenges exist within traditional data centers too. And not all redundancy is infrastructure-based. Application architectures must consider the risks too.

I submit that it is time that we need to consider a different approach to how we provide services. I’m not referring to IaaS services. I’m referring to application-level services (SaaS in many ways). Our application architectures have relied on redundant infrastructure at the most basic levels for some time. That includes networks, servers, storage and so on.

This may sound like a pipe dream, but application awareness needs to move much higher in the OSI stack. If you think about it, SaaS applications do this to some degree. Do you know which data center is serving data when visiting No. But when you put that in your browser, it works. Why is that? Does that mean that Google doesn’t have infrastructure failures? Do they have applications failures? Of course they do. But they’ve architected their applications and infrastructure to be resilient from failures.

In the case of the Amazon failure today, if the client applications were architected to leverage multiple Amazon data centers, would they have experienced an outage? While it may not have eliminated the entire outage for clients, it most likely would have reduced the impact. From the initial reports, the outage appears to be isolated to Amazon’s Virginia data center.

Some will argue that data sets are the Achilles Heel and prevent this type of redundant application architecture. I would propose that maybe we just haven’t figured out how to deal with it yet.

Bottom line: Failures are a reality in private data centers and in the cloud. We need to stop fearing failure and start expecting it. How we prepare our services and applications to respond to failure is what needs to change.

Tim Crawford is ranked as one of the Top 100 Most Influential Chief Information Technology Officers (#4), Top 100 Most Social CIOs (#7), Top 20 People Most Retweeted by IT Leaders (#5) and Top 100 Cloud Experts and Influencers. Tim is a strategic CIO & advisor that works with large global enterprise organizations across a number of industries including financial services, healthcare, major airlines and high-tech. Tim’s work differentiates and catapults organizations in transformative ways through the use of technology as a strategic lever. Tim takes a provocative, but pragmatic approach to the intersection of business and technology. Tim is an internationally renowned CIO thought leader including Digital Transformation, Cloud Computing, Data Analytics and Internet of Things (IoT). Tim has served as CIO and other senior IT roles with global organizations such as Konica Minolta/ All Covered, Stanford University, Knight-Ridder, Philips Electronics and National Semiconductor. Tim is also the host of the CIO In The Know (CIOitk) podcast. CIOitk is a weekly podcast that interviews CIOs on the top issues facing CIOs today. Tim holds an MBA in International Business with Honors from Golden Gate University Ageno School of Business and a Bachelor of Science degree in Computer Information Systems from Golden Gate University.

5 comments on “Amazon Outage Concerns Are Overblown

  1. Pingback: How to Leverage the Cloud for Disasters like Hurricane Sandy « IT's Evolutionary Transition

  2. Pingback: Is the cloud instable and what can we do about it? — Gigaom Research

  3. Pingback: Is the cloud instable and what can we do about it? | AVOA

  4. Pingback: How to Leverage the Cloud for Disasters like Hurricane Sandy – AVOA

  5. Tim, why not avoid the major providers like a ( AWS, Azure, or SoftChoice ) from the beginning? When they go down, a good majority of the internet goes with it. Why not stand away from the bunch and stay online when everyone else is down? They’re are tons of other providers around the world that can offer comparable services at a better price point with better service. Why don’t we talk about diversifying our providers and redundancy, rather than just redundancy? Would like your input on this.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.