The recent major reboots of cloud-based infrastructure by Amazon and Rackspace has resurfaced the question about cloud instability. Days before the reboot, both Amazon and Rackspace noted that the reboots were due to a vulnerability with Zen. Barb Darrow of Gigaom covered this in detail here. Ironically, all of this came less than a week before the action took place, leaving many flat-footed.
Outages are not new
First, let us admit that outages (and reboots) are not unique to cloud-based infrastructure. Traditional corporate data centers face unplanned outages and regular system reboots. For Microsoft-based infrastructure, reboots may happen monthly due to security patch updates. Back in April 2011, I wrote a piece Amazon Outage Concerns are Overblown. Amazon had just endured another outage of their Virginia data center that very day. In response, customers and observers took shots at Amazon. However, is Amazon’s outage really the problem? In the piece, I suggested that customers were misunderstanding the problem when they think about cloud-based infrastructure services.
Cloud expectations are misguided
As with the piece back in 2011, the expectations of cloud-based infrastructure have not changed much for enterprise customers. The expectation has been (and still is) that cloud-based infrastructure is resilient just like that within the corporate data center. The truth is very different. There are exceptions, but the majority of cloud-based infrastructure is not built for hardware resiliency. That’s by design. The expectation by service providers is that application/ service resiliency rests further up the stack when you move to cloud. That is very different than traditional application architectures found in the corporate data center where infrastructure provides the resiliency.
Time to expect failure in the cloud
Like many of the web-scale applications using cloud-based infrastructure today, enterprise applications need to rethink their architecture. If the assumption is that infrastructure will fail, how will that impact architectural decisions? When leveraging cloud-based infrastructure services from Amazon or Rackspace, this paradigm plays out well. If you lose the infrastructure, the application keeps humming away. Take out a data center, and users are still not impacted. Are we there yet? Nowhere close. But that is the direction we must take.
Getting from here to there
Hypothetically, if an application were built with the expectation of infrastructure failure, the recent failures would not have impacted the delivery to the user. Going further, imagine if the application could withstand a full data center outage and/ or a core intercontinental undersea fiber cut. If the expectation were for complete infrastructure failure, then the results would be quite different. Unfortunately, the reality is just not there…yet.
The vast majority of enterprise applications were never designed for cloud. Therefore, they need to be tweaked, re-architected or worse, completely rewritten. There’s a real cost to do so! Just because the application could be moved to cloud does not mean the economics are there to support it. Each application needs to be evaluated individually.
Building the counterargument
Some may say that this whole argument is hogwash. So, let us take a look at the alternative. If one does build cloud-based infrastructure to be resilient like that of its corporate brethren, it would result in a very expensive venture at a minimum. Infrastructure is expensive. Back in the 1970’s a company called Tandem Computers had a solution to this with their NonStop system. In the 1990’s, the Tandem NonStop Himalayan class systems were all the rage…if you could afford them. NonStop was particularly interesting for financial services organizations that 1) could not afford the downtime and 2) had the money to afford the system. Consequently, Tandem was acquired by Compaq who in turn was acquired by HP. NonStop is now owned by HP as part of their Integrity NonStop products. Aside from Tandem’s solutions, even with all of the infrastructure redundancy, many are still just a data center outage away of impacting an application. The bottom line is: It is impossible to build a 100% resilient infrastructure. That is true either due to 1) it is cost prohibitive and 2) becomes a statistical probability problem. For many, the value comes down to the statistic probably of an outage compared with the protections taken.
Making the move
Over the past five years or so, companies have looked at the economics to build redundancy (and resiliency) at the infrastructure layer. The net result is a renewed focus on moving away from infrastructure resiliency and toward low-cost hardware. The thinking is: infrastructure is expensive and resiliency needs to move up the stack. The challenge is changing the paradigm of how application redundancy is handled by developers of corporate applications.
Good entry, Tim. When I was managing software development resources, we often commented that hardware was cheaper than programmers. That’s part of the reason we got into the situation we are in today: it was less expensive to just throw redundant hardware (and by implication, facilities infrastructure, network connectivity, and power) at the resilience problem than to distract the scarce software engineering resources. The “business” partners also always argued (correctly, I might add) that they needed new features, not more resilient software, and so we were rarely able to complete projects that made the systems more reliable. Not much has changed, but maybe the wobbly cloud platforms, with their compelling financial advantages, will change that.