One of the most important (and frequently overlooked) jobs of an organization’s IT house is planning for disasters. Power outages, loss of internet connectivity, fire, flood, hardware failure, security intrusions, hurricanes, tornadoes, earthquakes, power spikes, pandemics (yes, pandemics!), the list goes on. In my career I’ve been part of a few disaster plans. Fairly often, the plans don’t survive first contact with the enemy, the insidious Mr. Murphy. Sometimes this is because Mr. Murphy is an extremely clever fellow, but fairly often this is because the plans confuse Disaster Prevention, Disaster Mitigation, and Disaster Recovery, and therefore exception scenarios fall through the cracks.
A real Disaster Plan should begin by an inventory of services and commodities – everything that your organization requires to function. For the purposes of the plan, a “service” is an engineered, technical system providing a set of functionality to a set of users, and a commodity is any basic utility required for operations. Examples of services include email, internet service, etc., whereas commodities are facility-level utilities such as power, air conditioning, water, etc. Assign a level of criticality to each service, based upon its impact upon the organization in the event of failure. My own classification system is: Enterprise Critical, Mission Critical, Enterprise Enabling, Mission Enabling, and Cosmetic. An Enterprise Critical service is one upon which the entire organization relies completely in order to continue operations. A Mission Critical service is one upon which an organizational or business unit relies upon completely in order to continue operations. An Enterprise Enabling service is one upon which the entire organization relies heavily, and a Mission Enabling service is one upon which an organizational or business unit relies heavily. Finally, Cosmetic services are ones which provide utility to an organization, but without which the organization can continue to function without significant loss of efficiency.
Each service should be fully analyzed for dependencies, both upon other services and upon commodities. Oftentimes, this analysis is not performed or is performed once and never reviewed, and as a result unplanned exception scenarios can crop up. For example, your original analysis may determine that your email service is Enterprise Critical -> if your email service goes off-line, your organization is heavily crippled. However, your centralized file service may be originally regarded as Enterprise Enabling, because your organization uses offline file storage or some similar technology and thus people can still access their files when the file server is down. However time passes, and your IT budget leads you to re-engineer your central file service. You move to a SAN solution, which provides file service not only to your desktop clients, but enables you to increase your mail storage and simultaneously cut some costs by reducing your backup complexity… you just need to remove the local storage on your email server and make it a SAN client. Suddenly, your file service has become Enterprise Critical -> if your SAN goes down, your mail service goes down as well.
Once you have created a baseline dependency tree for your services, you have the beginnings of a Disaster Plan.
The next step in creating a Disaster Plan is identifying what constitutes a “disaster”. Here a hierarchical, double-loop approach is useful. Too often, planners identify a particular disaster (here in California, the most common IT imagined disaster is of course “the earthquake”, whereas IT shops in the Gulf states are probably more focused on “hurricanes”), and write their plan entirely based upon the idea of mitigating the particular disasters (e.g., “seismically mount the racks in the server room!”). Although it is of course a good idea to identify particular disasters, it makes the most sense to begin with a “specific result from a more general root cause” classification – this allows a planner to easily identify solutions that have cross-disaster domain applicability. For example, “loss of building power” is a specific result that can have many root causes (fire, flood, earthquake, citywide power outage, meteor strike, terrorist activity, nuclear accident, etc.) Activities designed to protect your organization from “loss of building power” may or may not be applicable to each of those root causes (hence the double-loop) – taking the approach which effectively protects from many of the root causes, however, will generally be the most practical solution. Start with general definitions of disasters, then more specific root disasters, which will in turn assist in identifying more general disasters. Beginning with specific results ensures that your Disaster Plan will always be grounded in activities that mitigate the *actual* effects of root disasters upon your organization.
There are three major classes of activities that should be included in any Disaster Plan: Disaster Recovery, Disaster Mitigation, and Disaster Prevention. We’ll begin with Disaster Prevention.
Disaster Prevention activities are those which are designed to prevent a disaster from occurring entirely. By their very nature, Disaster Prevention strategies usually apply only to root causes – one cannot effectively design a Disaster Prevention activity for “loss of building power” without also taking into account all of the specific root causes. As an example, acquiring a backup generator for the purpose of *preventing* “loss of building power” will not be effective if the generator is not adequately protected against all of the disasters that are root causes of “loss of building power”. Disaster Prevention activities are therefore most often both extremely expensive and incredibly complex, and therefore unsuitable for all but the most critical activities.
Disaster Mitigation activities are those which are designed to maintain services in some acceptable state in the event of a disaster scenario. Properly designed, mitigation activities can easily cover many root causes without significant complexity -> rather than preventing “loss of building power”, for example, a mitigation strategy may simply involve replicating services at an off-site facility, which has the added advantage of being effective in many of the root causes scenarios (if the off-site facility is located far enough away, geographically, you may even cover county- and state- wide disasters such as earthquakes or hurricanes). Mitigation activities, for the most part, should be designed to cover *active* disaster scenarios -> that is, a Mitigation activity goes into effect at the onset of a disaster, remains in effect for the duration of the disaster, and at some finite (planned) point after the disaster is over, the mitigation activity is terminated.
Finally, Disaster Recovery activities are those which are designed to restore services to an active state, with an acceptable (or agreed upon) degradation and/or downtime, after a disaster is over. Regardless of your Prevention and/or Mitigation activities, Disaster Recovery plans are critical. First, because of the “unknown unknowns” – your Disaster Plan will not cover every contingency. Second, because your Mitigation activities are by design temporary in nature, and at some point they ought to terminate. In order for this to occur, the base service needs to be restored in its normal operational state.
When designing these activities, the IT professional must keep in mind the following two major principles: (a) never design a Disaster Plan that includes activities that are more expensive than the value of the services they protect and (b) accept the fact that there are disaster scenarios that *will exist outside the scope of the plan*. This second point seems to be common-sense intuitive, but is often overlooked. In a reductio ad absurdum illustration, I have yet to see a Disaster Plan that includes “Sol goes Nova” or “Asteroid the Size of Rhode Island Destroys all Civilization on Earth”. There will always be disaster scenarios that are (a) significantly unlikely; (b) produce effects that are fiscally unreasonable to attempt to prevent; (c) have such a drastic effect upon your organization’s structure that recovery involves re-engineering the organization essentially from scratch; or finally (d) eliminate the purpose of the organization altogether.
Edited to add (08/15/2007): Pingback for IT Toolkits. Any template that has 189 pages and 14 job descriptions… well, there are scalability issues. However, one link deserves another…