Business continuity and disaster recovery auditing

Given organizations' increasing dependency on information technology (IT) to run their operations, business continuity planning covers the entire organization, while disaster recovery focuses on IT.

Auditing documents covering an organization's business continuity and disaster recovery (BCDR) plans provides a third-party validation to stakeholders that the documentation is complete and does not contain material misrepresentations.

Overview
Often used together, the terms business continuity (BC) and disaster recovery (DR) are very different. BC refers to the ability of a business to continue critical functions and business processes after the occurrence of a disaster, whereas DR refers specifically to the IT functions of the business, albeit a subset of BC.

Metrics
The primary objective is to protect the organization in the event that all or part of its operations and/or computer services are rendered partially or completely unusable.



DR metrics
Minimizing downtime and data loss during disaster recovery is typically measured in terms of two key concepts:
 * Recovery time objective (RTO), time until a system is completely up and running
 * Recovery point objective (RPO), a measure of the ability to recover files by specifying a point in time the backup copy will restore to.

The auditor's role
An auditor examines and assesses
 * the procedures stated in the BCP and DR plan are actually consistent with real practice
 * a specific individual within the organization, who may be referred to as the disaster recovery officer, the disaster recovery liaison, the DR coordinator, or some other similar title, has the technical skills, training, experience, and abilities to analyze the capabilities of the team members to complete assigned tasks
 * more than one individual is trained and capable of doing a particular function during the DR exercise. Tests and inquiries of personnel can help achieve this objective.

Disaster recovery plan
A disaster recovery plan (DRP) is a documented process or set of procedures to execute an organization's disaster recovery processes and recover and protect a business IT infrastructure in the event of a disaster. It is "a comprehensive statement of consistent actions to be taken before, during and after a disaster". The disaster could be natural, environmental or man-made. Man-made disasters could be intentional (for example, an act of a terrorist) or unintentional (that is, accidental, such as the breakage of a man-made dam or even "fat fingers" - or errant commands entered - on a computer system).

Types of plans
Although there is no one-size-fits-all plan, there are three basic strategies:
 * 1) prevention, including proper backups, having surge protectors and generators
 * 2) detection, a byproduct of routine inspections, which may discover new (potential) threats
 * 3) correction

The latter may include securing proper insurance policies, and holding a "lessons learned" brainstorming session.

Best practices
To maximize their effectiveness, DRPs are most effective when updated frequently, and should:
 * be an integral part of all business analysis processes,
 * be revisited at every major corporate acquisition, at every new product launch and at every new system development milestone.

Adequate records need to be retained by the organization. The auditor examines records, billings, and contracts to verify that records are being kept. One such record is a current list of the organization's hardware and software vendors. Such list is made and periodically updated to reflect changing business practices and as part of an IT asset management system. Copies of it are stored on and off site and are made available or accessible to those who require them. An auditor tests the procedures used to meet this objective and determine their effectiveness.

Relationship to BCPs
Disaster recovery is a subset of business continuity. Where DRP encompasses the policies, tools and procedures to enable recovery of data following a catastrophic event, BCP involves keeping all aspects of a business functioning regardless of potential disruptive events. As such, a business continuity plan is a comprehensive organizational strategy that includes the DRP as well as threat prevention, detection, recovery, and resumption of operations should a data breach or other disaster event occur. Therefore, BCP consists of five component plans:
 * Business resumption plan
 * Occupant emergency plan
 * Continuity of operations plan
 * Incident management plan
 * Disaster recovery plan

The first three components (business resumption, occupant emergency, and continuity of operations plans) do not deal with the IT infrastructure. The incident management plan (IMP) does deal with the IT infrastructure, but since it establishes structure and procedures to address cyber attacks against an organization's IT systems, it generally does not represent an agent for activating the DRP; thus DRP is the only BCP component of active interest to IT.

Benefits
Like every insurance plan, there are benefits that can be obtained from proper business continuity planning, including: Studies have shown a correlation between higher spending on auditing fees and lower rates of incidents.


 * Minimizing risk of delays
 * Guaranteeing the reliability of standby systems (even automating the failure detection and recovery in certain scenarios)
 * Providing a standard for testing the plan
 * Minimizing decision-making during a disaster
 * Reducing potential legal liabilities
 * Lowering unnecessarily stressful work environment

Planning and testing methodology
According to Geoffrey H. Wold of the Disaster Recovery Journal, the entire process involved in developing a Disaster Recovery Plan consists of 10 steps:
 * Performing a risk assessment: The planning committee prepares a risk analysis and a business impact analysis (BIA) that includes a range of possible disasters. Each functional area of the organization is analyzed to determine potential consequences. Traditionally, fire has posed the greatest threat. A thorough plan provides for "worst case" situations, such as destruction of the main building.
 * Establishing priorities for processing and operations: Critical needs of each department are evaluated and prioritized. Written agreements for alternatives selected are prepared, with details specifying duration, termination conditions, system testing, cost, any special security procedures, procedure for the notification of system changes, hours of operation, the specific hardware and other equipment required for processing, personnel requirements, definition of the circumstances constituting an emergency, process to negotiate service extensions, guarantee of compatibility, availability, non-mainframe resource requirements, priorities, and other contractual issues.
 * Collecting data: This includes various lists (employee backup position listing, critical telephone numbers list, master call list, master vendor list, notification checklist), inventories (communications equipment, documentation, office equipment, forms, insurance policies, workgroup and data center computer hardware, microcomputer hardware and software, office supply, off-site storage location equipment, telephones, etc.), distribution register, software and data files backup/retention schedules, temporary location specifications, any other such lists, materials, inventories, and documentation. Pre-formatted forms are often used to facilitate the data gathering process.
 * Organizing and documenting a written plan
 * Developing testing criteria and procedures: reasons for testing include
 * Determining the feasibility and compatibility of backup facilities and procedures.
 * Identifying areas in the plan that need modification.
 * Providing training to the team managers and team members.
 * Demonstrating the ability of the organization to recover.
 * Providing motivation for maintaining and updating the disaster recovery plan.
 * Testing the plan: An initial "dry run" of the plan is performed by conducting a structured walk-through test. An actual test-run must be performed. Problems are corrected.

Initial testing can be plan is done in sections and after normal business hours to minimize disruptions. Subsequent tests occur during normal business hours.

Types of tests include: checklist tests, simulation tests, parallel tests, and full interruption tests.

Caveats/controversies
Due to high cost, various plans are not without critics. Dell has identified five "common mistakes" organizations often make related to BCP/DR planning:
 * Lack of buy-in: When executive management sees DR planning as "just another fake earthquake drill" or CEOs fail to make DR planning and preparation a priority
 * Incomplete RTOs and RPOs: Failure to include each and every important business process or a block of data. Ripples can extend a disaster's impact. Payroll may not initially be mission-critical, but left alone for several days, it can become more important than any of your initial problems.
 * Systems myopia: A third point of failure involves focusing only on DR without considering the larger business continuity needs. Corporate office space lost to a disaster can result in an instant pool of teleworkers which, in turn, can overload a company's VPN overnight, overwork the IT support staff at the blink of an eye and cause serious bottlenecks and monopolies with the dial-in PBX system.
 * Lax security: When there is a disaster, an organization's data and business processes become vulnerable. As such, security can be more important than the raw speed involved in a disaster recovery plan's RTO. The most critical consideration then becomes securing the new data pipelines: from new VPNs to the connection from offsite backup services.
 * In disasters, planning for post-mortem forensics
 * Locking down or remotely wiping lost handheld devices

Decisions and strategies
Site designation: choice of a backup site. A hot site is fully equipped to resume operations while a cold site does not have that capability. A warm site has the capability to resume some, but not all operations.

A cost-benefit analysis is needed.


 * Occasional tests and trials verify the viability and effectiveness of the plan. An auditor looks into the probability that operations of the organization can be sustained at the level that is assumed in the plan, and the ability of the entity to actually establish operations at the site.
 * The auditor can verify this through paper and paperless documentation and actual physical observation. The security of the storage site is also confirmed.

Data backup: An audit of backup processes determines if (a) they are effective, and (b) if they are actually being implemented by the involved personnel. The disaster recovery plan also includes information on how best to recover any data that has not been copied. Controls and protections are put in place to ensure that data is not damaged, altered, or destroyed during this process.

Drills: Practice drills conducted periodically to determine how effective the plan is and to determine what changes may be necessary. The auditor's primary concern here is verifying that these drills are being conducted properly and that problems uncovered during these drills are addressed.

Backup of key personnel - including periodic training, cross-training, and personnel redundancy.

Insurance issues
The auditor determines the adequacy of the company's insurance coverage (particularly property and casualty insurance) through a review of the company's insurance policies and other research. Among the items that the auditor needs to verify are: the scope of the policy (including any stated exclusions), that the amount of coverage is sufficient to cover the organization's needs, and that the policy is current and in force. The auditor also ascertains, through a review of the ratings assigned by independent rating agencies, that the insurance company or companies providing the coverage have the financial viability to cover the losses in the event of a disaster.

Effective DR plans take into account the extent of a company's responsibilities to other entities and its ability to fulfill those commitments despite a major disaster. A good DR audit will include a review of existing MOA and contracts to ensure that the organization's legal liability for lack of performance in the event of disaster or any other unusual circumstance is minimized. Agreements pertaining to establishing support and assisting with recovery for the entity are also outlined. Techniques used for evaluating this area include an examination of the reasonableness of the plan, a determination of whether or not the plan takes all factors into account, and a verification of the contracts and agreements reasonableness through documentation and outside research.

Communication issues
The auditor must verify that planning ensures that both management and the recovery team have effective communication hardware, contact information for both internal communication and external issues, such as business partners and key customers.

Audit techniques include
 * testing of procedures, interviewing employees, making comparison against the plans of other company and against industry standards,
 * examining company manuals and other written procedures.
 * direct observation that emergency telephone numbers are listed and easily accessible in the event of a disaster.

Emergency procedures
Procedures to sustain staff during a round-the-clock disaster recovery effort are included in any good disaster recovery plan. Procedures for the stocking of food and water, capabilities of administering CPR/first aid, and dealing with family emergencies are clearly written and tested. This can generally be accomplished by the company through good training programs and a clear definition of job responsibilities. A review of the readiness capacity of a plan often includes tasks such as inquires of personnel, direct physical observation, and examination of training records and any certifications.

Environmental issues
The auditor must review procedures that take into account the possibility of power failures or other situations that are of a non-IT nature.
 * Flashlights and candles may be needed.
 * Safety procedures in case of gas leaks, fires or other such phenomena and PPE may be needed.