The Risk and Impact on Busness
The risk analysis process when planning for disaster recovery provides the foundation for the entire disaster recovery planning effort. This analysis involves identifying the most probable threats that may lead to disastrous outcomes for an organization and minimizing the exposures to them. By reasoning through the possibilities, the business gets a better idea of what is important for disaster recovery. The organization as a whole also gains a valuable understanding of the mechanism of disaster, resulting in more useful plans. This is in contrast to the "be ready for anything" philosophy espoused by some planners.
|
In practice, most planners do prioritize planning on the basis of at least some rough estimate of the likelihood and costs associated with possible disasters. One might, quite rationally, choose to dispense with earthquake planning in a seismically inactive area of New England. To other businesses in Southern California, on the other hand, earthquakes are a major concern.
As a complement to risk analysis, the business impact analysis (BIA) determines the effect that each type of potential threat, identified in the risk analysis, has on various functions or departments within the organization. Types of criteria that can be used to evaluate this impact include:
Customer service
Internal operations
Legal/statutory
Financial
The Northrop Grumman organization characterizes the major activities involved in BIA strategy as:
1. Defining criteria for what is critical
2. Identifying vital business processes
3. Identifying systems and information that support vital business processes
4. Identifying vital datasets and records
5. Determining business cost impacts
6. Identifying interdependencies
7. Defining recovery windows.
Collecting the following information during a BIA may directly influence the strategies developed for system backup:
What applications are critical or vital? One task of this analysis is to rank the relative criticality of applications to business recovery following a disaster. Critical and vital applications are defined as those that facilitate key business
functions and for which alternatives are unavailable. In short, these applications need to be restored within a short time following a disaster if business recovery is to be accomplished. (Keep in mind that these "critical applications" may have other applications which provide input to them. If so, they may be critical also.)
What is the minimum acceptable hardware configuration? Once application criticality is defined, the analysis goes further to identify the hardware (both CPUs and storage devices) that is used by the application in performance
of the critical or vital business function. From the perspective of disaster recovery planning, it may be possible to view hardware capabilities utilized by all non-critical and non-vital applications as "spare capacity." Hence, during its emergency operations, the business may be able to settle for far less CPU/storage capacity than it normally utilizes. The ramifications of this view are twofold. First, if less capacity is needed to run critical and vital applications, recovery hardware need not match production hardware on a one-to-one basis. However, it must do so from a compatibility standpoint. Second, if critical and vital applications run on several homogeneous (or
compatible) processors in normal business operations, it may be possible to replace several low-end processors with a single high-end processor. The net result of this analysis is what may be called a "minimum acceptable hardware configuration." This configuration, which must be implemented quickly in the event of a disaster, may require substantial technical assistance to design and still more assistance to develop and test, but a workable minimum configuration can drastically reduce the costs of the disaster capability.
How many users? The analysis also identifies the number of users who would need access to applications to continue business functions at emergency levels. The number of personnel needed may be far fewer than the number employed in normal business operations. Whatever the number, they will need a designated work area, PCs/terminals, communication equipment, and network connectivity.
What are the business function requirements? Besides application and user requirements, the analysis also identifies, for each critical (and vital) job or business function, what inputs are required and what outputs are produced. Through the analysis of the data, much can be learned that will aid in the identification of an appropriate system backup strategy. The business impact analysis will also identify any special preprinted forms required for work as well as any voice communications, photocopying, facsimile transmission, courier services, and U.S. mail resources needed to complete the job.
The BIA is the key to the development of objectives for many of the disaster recovery tasks that will follow. Its many activities involve interviewing company IS personnel and users, collating responses into a comprehensive view of the corporate information asset, and formulating criteria and objectives for the plans that will be created to safeguard this asset.
The four basic objectives of risk analysis and BIA are to:
1. Identify company assets and functions that are necessary for business resumption following a disaster, and prioritize them according to time sensitivity and criticality. (BIA)
2. Identify most probable threats to assets and functions. (RA)
3. Set objectives for developing strategies to eliminate avoidable risks and minimize the impact of risks that cannot be eliminated. (RA)
4. Set objectives for developing strategies to backup and/or recover those critical business functions that may be lost in the event of a disaster. (BIA)
Idendify and Prioritize Assests and Functions
A BIA consists of two basic operations: data collection and data analysis. The data collected in this analysis should include a comprehensive list of computer and telecommunications hardware and a complete inventory of applications and systems software.
Typically, this data is collected from employees who use and manage systems on a daily basis. A questionnaire may be sent to every user of a specific system or application (and to those responsible for maintaining the system or application) to identify its use in the performance of normal work. It should be noted that questionnaires do not work in the real world. However, creating the questionnaire is a starting point that allows you to formulate the questions you need answered. If you do send them out, expect only a 30% response back. A more reasonable approach would be to take the questionnaire to the users and dialog with them while you write their answers down.
Technicians should be asked more specific questions regarding file layouts, operational hardware configurations, etc. A well-designed user questionnaire asks the respondent to explain how he or she would deal with an outage that was one hour in duration, 24 hours, 48 hours, 72 hours, and so forth. If no coping strategies are identified, the system is probably critical. For example, Solutions with Technologies has developed a recovery classification scheme especially for application software. It is based on estimates of recovery time objectives (RTO) and recovery point objectives (RPO).
The ultimate purpose of the questionnaire is to identify the criticality of each application or system. This is often accomplished by asking users how they would cope if the system were unavailable for a specified period of time as illustrated in the Solutions Technology strategy. As an alternate to in-house development of BIA questionnaires, a company may elect to purchase existing software that automates the BIA process and provides pre-designed forms for data collection purposes rather than invest a lot of time in designing questionnaires.
The ability to cope with system interruption is call tolerance. In practical terms, tolerance may be expressed as a dollar value: It is the loss of revenues to the company from system outages of specific duration. If there is a very low tolerance within the company to the loss of a piece of equipment or to the interruption of the function it provides, this low tolerance is expressed as a high dollar value or cost. If, on the other hand, the company can tolerate to a significant extent the loss or interruption of a processing function, this high tolerance is expressed as a low dollar value or cost.
It is important to recognize that the dollar value to the company of a given system may have little to do with the dollar cost of the hardware or software used in the system. A PC, using an off-the-shelf spreadsheet package and a few hundred kilobytes of corporate financial data, may provide a low tolerance business function whose loss would be far more expensive than the loss of any given application running on a mainframe.
Applications or equipment whose loss or outage would entail great costs for the company are termed critical. Conversely, high tolerance functions are referred to as non-critical. For example, the loss of a telemarketing company's telecommunications switch would represent a low tolerance or high dollar loss. For each minute that the switch is down, the company is unable to do business and loses money. The telecommunications system, therefore, would be regarded as a critical system.
If, on the other hand, the same telemarketing firm were to lose the function of a computer application used to generate random telephone numbers, the financial impact of the outage would be very different. Since the company could readily change over to a manual system of random number generation, this application would not cost the company nearly as much as a telecommunications outage of the same duration. Thus, the application could be considered non-critical.
For many users, the tolerance of an outage may be based upon the length of time that the system or application is unavailable for use. Supporting documentation for this method of reporting critical function requirements would be an inventory of those functions contained within each percentage group. This type of reporting scheme provides BRP developers with an excellent guide for prioritizing recovery strategies.
Variances in tolerance may also be linked to the time of the day (or month) an outage occurs. (Risk analysis, in general, should assume that an outage would always occur at the worst possible time.) It is not uncommon for users to identify mitigating factors, such as the timing of a disaster, when they assess the criticality of their systems. Critical systems are often defined as such because, regardless of duration of the outage or the time of month in which an outage occurs, there are no substitute methods for providing the functions of the system.
Disaster Recovery: Business Impact
How will business be impacted?
In the context of conventional data processing, applications may be classified using the following spectrum of tolerances:
Critical. These functions cannot be performed unless identical (or close to identical) capabilities are found to replace the company's damaged capabilities. Critical applications cannot be replaced by manual methods under any circumstances. Tolerance to interruption is very low and the cost of interruption is very high. Thus, for critical systems and applications, the company needs to arrange access to comparable hardware and, in an emergency, plan to transfer the system application and associated files to the "backup" hardware in order to resume processing.
Vital. These functions cannot be performed by manual means or can be performed manually for only a very brief period of time. There is somewhat higher tolerance to interruption and somewhat lower costs, provided that functions are restored within a certain time frame (usually five days). In applications classified as vital, a brief suspension of processing can be tolerated, but a considerable amount of "catching up" is needed to restore data to a current or useable form.
Sensitive. These functions can be performed, with difficulty but at tolerable cost, by manual means for an extended period of time. Sensitive applications, however, require considerable "catching up" once restored.
Non-critical. These applications may be interrupted for an extended period of time, at little or no cost to the company, and require little or no "catching up" when restored.
In addition to asking users to classify their system's criticality and identify strategies for coping with outages, another question should be asked during initial data collection: "How much would outages of the specified duration cost the company?" User departments are often able to compile compelling cost analyses of the effects of downtime. They may collect dollar-cost data during normal operations for the purpose of demonstrating departmental performance. Many times this data can be adapted to show the dollar value of the work that would be lost if an interruption occurred.
Identify Threats to Assets and Functions
Once the criticality of systems has been assessed, the second objective, involving risk analysis, is to identify what threats exist to normal information processing activity. The best method for identifying threats is to look at the phenomena, regardless of origin, that typically cause a loss of normal system function.
These may be summarized succinctly as threats to:
Environmental support systems (i.e., electricity, water, gas)
System hardware or telecommunications
Facilities
Note: Threats related to the intentional abuse of systems by persons who are or are not corporate employees are typically viewed as security threats. Security planning is the twin of disaster recovery planning, and the two functions must work together to accomplish their respective goals.
Prioritize Disaster Recovery Planning Efforts
The relative probability of a disaster occurring should be determined in conjunction with a risk analysis. Items to consider in determining the probability of a specific disaster should include, but not be limited to:
Geographic location
Topography of the area
Proximity to power sources, bodies of water, and airports
Degree of accessibility to the organization
History of local utility companies in providing uninterrupted services
History of the area's susceptibility to natural threats
Proximity to major highways which transport hazardous waste and combustible
products
Proximity to nuclear power plants
Other factors
Note: To effectively address the other threats to normal operations, such as fire and flooding, the person responsible for conducting the risk analysis needs to work with individuals at practically every level within the company.
All locations and facilities should be included in the risk assessment. The analysis should provide for the "worst case" situation: destruction of the main facility. Rather than attempting to determine exact probabilities of each disaster, a general relational rating system of high, medium, and low can be used initially to identify the threats with the highest probability. Each level of probability can then be assigned points as follows:
High = 10
Medium = 5
Low = 1
The impact on business functions or facilities can be rated as:
0 = No impact or interruption in operations
1 = Noticeable impact, interruption in operations for up to 8 hours
2 = Damage to equipment and/or facilities, interruption in operations for 8-48hours
3 = Major damage to the equipment and/or facilities, interruption in operations for more than 48 hours
To obtain a weighted-risk rating, they suggest that the probability points be multiplied by the highest impact rating for each facility. For example, if the probability of hurricanes is high (10 points) and the impact rating to a facility is "3" (indicating that a move to alternate facilities would be required), then the weighted risk factor is 30 (10 x 3). Based on this rating method, threats that pose the greatest risk and impact (e.g., 15 points and above) can be identified.
A low total implies a low vulnerability to the specific type of emergency, while a higher total indicates the need to direct efforts to either reduce or eliminate the threat/impact of a particular emergency.
The practical value of threat identification is twofold: (1) it serves to point out where disaster avoidance measures (such as halon systems, security, access systems, and power protection systems) may be needed; and (2) it identifies specific vulnerabilities that plans and procedures must specifically address.
Another benefit of the threat identification process is less methodological than psychological. Focusing a group's attention on the threat potential may increase members' awareness and sensitivity. Threat identification may also serve to make participants more aware of the interdependencies that exist among them and build team unity by clarifying shared vulnerabilities.
Much disaster planning today is based on a "seat-of-the-pants" approach. Generally, informal analysis based on a planner's intuition of disaster potentials has been rather successful. The problem is that the world isn't getting any less complex-only more. Therefore, some authors recommend more formal methods of analysis-like the scenario-based approach to risk analysis.
Regardless of what method is used to prioritize disaster recovery planning efforts, it is important to first eliminate exposures that can be eliminated and to minimize the effects of those that cannot be eliminated. Having classified systems criticality, assigned costs to system outages of various durations, and identified threats to systems and data, it remains to analyze this data to formulate a set of specific objectives to guide the development of the recovery capability. The strategies to recover the critical business functions identified during the BIA can then be addressed in the disaster recovery plan.
A clear set of stated objectives, identifying the conditions, tasks, and standards for each protection or recovery strategy, is often a prerequisite for justifying the strategy to those who will have to absorb the cost or modify existing procedures. Representatives of the department that will be affected and senior management will want to know how all of the objectives fit together in a comprehensive recovery strategy.
The benefits derived from performing a comprehensive business impact analysis include:
Reducing legal liability
Minimizing potential economic loss
Decreasing potential exposure
Reducing the probability of a disaster occurrence
Reducing disruption to normal operations
Ensuring organizational stability
Ensuring orderly, systematic, and timely recovery
Minimizing insurance premiums
Reducing reliance on key personnel
Increasing asset protection
Ensuring safety of personnel and customers
Complying with legal, statutory, and regulatory requirements.
Risk and business impact analyses concern the entire organization not just the computer systems. A comprehensive business impact analysis requires the involvement of all business units and departments.
