Failing to Plan is Planning to Fail


Failing to Plan is Planning to Fail

There has been quite a bit of press coverage regarding the  outage at Amazon this week.  Much of this coverage has focused on how the outage has brought down many popular web sites such as Reddit, Quora and Foursquare.  The point that seems to be getting missed here is that by failing to plan, these companies planned to fail.  Technology professionals know that data centers fail all the time.  Data center failures are a fact of life that must be planned for and dealt with.  While it’s true that Amazon did not live up to expectations, they actually did not violate their service level agreement (SLA) as Gartner pointed out.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.

Amazon is an infrastructure as a service (IaaS) provider which means they provide the hardware and low level software used to support Cloud based applications.  The beauty of the IaaS model is that you can design and build an application anyway you see fit based on your individual requirements.  If your application requires high availability and you choose not to address that requirement in your design, then you have introduced risk into your environment.  We call this design shortcoming “technical debt.”  I have blogged extensively on the subject which can be referenced if additional background is needed.

The principal amount of this technical debt is the cost of implementing the required redundancy.  The interest is the cost of the additional risk associated with not having appropriate levels of redundancy.  There are several ways to assign dollars to risk but none of them are perfect.  The most straightforward approach is toestimate the cost of a failure and then multiply by the probability it will occur.  Let’s say Foursquare estimates that the cost of their website going down for 24 hours is one million dollars.  Based on the optimal design and implementation of the application, the probability of such an outage is 0.5 percent a year.  However, because Foursquare took some design shortcuts the probability increased to 4 percent a year.  The interest on this technical debt can be calculated as follows.

Incremental Risk: 4%-0.5% = 3.5%
Cost of Failure: $1,000,000
Interest: $35,000
Should Foursquare have implemented the redundancy needed to achieve required uptime?  The answer depends on the principal of the debt.  If the cost of providing redundancy is $5,000 then it would be a very easy decision.  If you invest $5,000 and eliminate $35,000 of risk you’re achieving an ROI of 700%.  It would be a different story if the redundancy cost is $100,000.  That would provide an ROI of 35% which might not be a good investment.  Not many investors would sign up to get back 35 cents for every dollar they invest.

The fact is that Cloud redundancy is cheap and Foursquare would have achieved an astronomical ROI by implementing it.  The culprit in this outage is not the Cloud.  It is technical debt.  Blaming the Cloud for these outages would be like blaming your hard drive manufacturer for lost data when it fails.  Everyone knows hard drives fail and you should always have a backup.  If your hard drive happened to be backed up by one of the many tools that leverage Amazon’s S3 service that’s not technical debt.  That’s just bad luck!

This entry was posted in Blog. Bookmark the permalink.

Leave a Reply