Rackspace, a popular hosting provider in the cloud, suffered a significant outage on June 29, 2009. Apparently, a power interruption caused their Dallas (DFW-Grapevine) data center to go offline. Rackspace has posted a copy of the incident report here:
Click to access DFWIncidentReport6-29-2009.pdf
As a consequence, Rackspace expects to issue service credits to customers in the range of $2.5m-$3.5m. In response, Rackspace filed a Form 8-K with the SEC:
The Rackspace outage is bound to bring questions about the stability of services in the cloud. But should it? The outage that Rackspace (and their customers) experienced could have happened to any data center owner. So, why is Rackspace being held to a different standard?
Whenever a provider fails to deliver a service, it can affect a business that relies on those services. Just as a traditional IT organization would not rely on a single data center, nor should we expect the services we leverage in the cloud.
When working in the cloud, a change to the traditional method of redundancy is warranted. Cloud providers could potentially provide geo-diversity for customers. But the customer should really consider how to provide redundancy across providers. That way, if any failure happens with one provider, a second provider is there to pickup the demand.
In some ways, this potentially eliminates the value of an SLA (Service Level Agreement). I will discuss more on SLA value in a future blog post.
This redundancy does come at a cost (cloud-based or traditional model). A risk assessment and cost benefit analysis should be performed to better understand the options and path to take.
Most likely a “Time for a Change in Strategy”
I suspect early adopters of Cloud computing have not fully embraced the ability to leverage services such as Amazon’s availability zones that place instances on physically distinct, independent infrastructures. If you simply move your application to the Cloud it is still vulnerable to a single point of failure.
Cloud adoption should start with strategic planning of software architecture such that the end product is resilient to single infrastructure outage. That may be more than one provider but it doesn’t have to be. It is a new way of thinking for many in IT.
Ah, strategic planning 🙂
As a side note, SLAs also need to change to reflect the nature of Cloud. Well, that is if your not just having an outside IT house simply host your application. But, that is not really ‘Cloud’ computing from my perspective.