I read Joel On Software regularly, and I recommend it highly. Whether you’re a “software programming is an engineering discipline!” wonk or a “software programming is art!” evangelist, Joel has interesting things to say about writing software, deploying software, and occasionally stuff that software programmers never learn about in school, hardly ever think about at work, and really ought to think about more.
This post is an interesting analysis of the concept of SLA’s, uptime, and what it means to be a service provider. He makes some really good points.
One of the problems with the IT field in general is that people who like create metrics and targets have a tendency to write for Harvard Business Review. I’m not knocking HBR, per se, but it does have one problem; on the whole, Harvard Business Review is written and edited to appeal to people who read Harvard Business Review. This audience has a tendency to be comprised mostly of people who are executives at large corporations. Six nines uptime is a very difficult target, and not one that small to medium organizations are going to be aiming at very often.
From Continuity Central (linked in Joel’s post):
Table 1: Uptime and Maximum Downtime
Uptime Uptime Downtime per Year Six nines 99.9999% 31.5 seconds Five nines 99.999% 5 minutes 35 seconds Four nines 99.99% 52 minutes 33 seconds Three nines 99.9% 8 hours 46 minutes Two nines 99.0% 87 hours 36 minutes One nine 90.0% 36 days 12 hours
“Five or Six Nines” kinds of numbers, and the amount of money required to attain them, are important to Fortune 500 companies, national infrastructure, and the military. They’re not really feasible for small or medium organizations, and really ought not to be considered. For a company that does a million dollars of business a year, a day-long outage isn’t really that big of a deal. The ~ $2,700 that they’d lose isn’t a hell of a lot, in comparison to the money that they would spend moving from three nines to four nines. They’re probably not going to lose all that money anyway, since a million-dollar a year company probably isn’t going to lose the entire transaction. The customer will try to log into the web site, fail, and call their customer service rep, who they probably know and talk to regularly anyway. Compare this with a company that does hundreds of millions or billions of dollars of sales in a year, like Dell Computer. Here the customer tries to log into the web site, fails, and most likely isn’t going to attack Dell’s sales phone tree, instead they’ll move on to hp.com or toshiba.com. They might not make the sale elsewhere, but when you’re talking big numbers, small percentages start making a major difference.
For small to medium organizations, the loss isn’t in the actual downtime, it’s in the perception of the customer. Joel’s solution here is aimed at squarely at solving that problem, and it’s a really nice idea.