Disasters can strike at any moment. While the consequences can be unexpected, how you deal with these incidents will determine how much your customers trust you can deliver reliably.
According to a 2023 survey by LogicMonitor, 96% of IT decision-makers have experienced at least one outage in the past three years, with many of these organizations experiencing several outages in a single year.
Responding to an outage requires a thoughtful strategy that includes planning, system redundancy, testing and employee training. This blog post details several aspects of disaster recovery (DR) to consider when creating a new strategy or updating an existing one.
Disaster Recovery Process
Responding to a disaster takes preparation. The disaster recovery procedures listed below take into account activities that take place before, during, and after the event.
Obtain executive management commitment.
Senior managers within the organization must commit to people and other resources that are needed to develop the DR plan as well as the continuing execution of the plan once approved. At least one of these senior leaders should be part of a steering committee that will be composed of technology and data managers that help shape the plan, which provides scope.
Create a disaster recovery team.
A disaster recovery team is a collaborative team of experts who own critical operational aspects of any response to an outage or disaster:
Response Manager: This individual owns the disaster recovery process. When an event takes place, this person leads those individuals responsible for implementing all parts of the disaster recovery plan. This person will also communicate and work with vendor partners who participate in the recovery.
Business Continuity Manager: This person ensures that the disaster recovery efforts meet business expectations and align with the anticipated business impacts. The business continuity manager will report on the progress of any response to the executive managers who are a part of the disaster recovery steering committee.
IT Impact and Recovery Managers: These individuals are experts in different technology and data domains who can quickly assess impacts and recommend any fixes required. They typically are involved in disaster recovery tasks, ensuring that systems and data are brought back online.
Perform a risk assessment.
One of the first tasks of the steering committee will be to draft a risk assessment that describes the business impact of outages resulting from different types of disasters. The risk assessment should consider risks by functional areas within the business, considering all impacts and costs. The committee should also consider the cost to minimize potential risk exposure.
Inventory and prioritize your assets.
Create a list of all the assets needed for your business to operate. This includes technology, data, documentation and other assets. Prioritize this list into several categories: critical, important, and non-essential. By prioritizing these assets, decisions can be made regarding which assets must be recovered first and set a timeline for recovery.
Draft a disaster recovery plan.
The disaster recovery plan can be a stand-alone document or be part of a broader business continuity plan. While the sections and content will vary depending on business needs, most disaster recovery plans will contain the following information:
- Objectives
- Key employee contact information
- Key roles and responsibilities
- Communication plan
- Asset inventory – including recovery times
- Location of backups and disaster recovery sites
- Disaster response procedures
- Plan testing and maintenance
Test the disaster recovery plan.
Once drafted, testing must be performed, which includes bringing IT systems and services offline. As systems and services are brought back online, there may be changes needed to priority, timing, or responsibilities. Make sure the steering committee is involved in this test, as their approval should be obtained once the plan is finalized after testing.
Run disaster recovery drills.
Since the disaster recovery plan is a living document, drills should be run every year to test it. The plan should also be updated to reflect changes to personnel and/or technology.
Disaster Recovery Templates
The disaster recovery plan contains many parts that all need to be correct and current. While plans will vary depending on the needs of the business, consider the following elements to include:
Objective: Ensure that the organization can recover quickly and that all operational policies are followed is an important message to state upfront. Include any governance guidance in the objective.
Key Employee Contact Information: This list should include the employee’s name, role within the organization, the best phone number and email address to reach them, or the one they are likely to answer in the event of an emergency.
Key Roles and Responsibilities: This section includes individuals on the disaster recovery team and others who are responsible for participating in a disaster response. Include their names, roles in the disaster response and descriptions of their responsibilities.
Communication Plan: The communication plan contains several parts and will take time and consideration to develop.
- Internal communication, including quickly alerting your employees and contractors about a disaster is a top priority (Department leaders should all follow the same protocol to ensure consistency in the messaging.)
- External communication, including external messaging to customers, stakeholders and, in some cases, the media, is critical (Develop this messaging in advance to ensure communication is swift and comprehensive.)
Asset Inventory: An accurate inventory of hardware, software and data is critically important to help the organization understand all assets that support critical business functions. Inventories include dependencies between assets and can identify single points of failure. Asset inventories can also show links to vendors and contractors who will need to be a part of your disaster recovery planning.
Location of Backups and Disaster Recovery Sites: This includes physical addresses, correct phone numbers, and the names of people to contact when a disaster occurs.
Disaster Response Procedures: Procedures to bring systems, networks and applications back online will vary, but consider including recovery time objectives and recovery point objectives where needed. If utilizing backup and recovery services with an outside vendor, establish work protocols and ensure open communication.
Plan Testing and Maintenance: While testing the disaster recovery plan does take time, it is a critical activity. There are several types of tests to consider: conducting a regular review of the plan, a tabletop exercise and a simulation test. Some organizations perform all three regularly.
Disaster Recovery Tools
Part of the disaster recovery plan will include the tools and types of backups and recovery that the steering committee has selected. There are several to consider:
Data Backup: This is the most common type of backup that organizations consider; it is also the easiest to implement. The tools that enable data backup and restore can be a physical drive (which can be located on or off-site) or a cloud-based backup and restore service.
Data Center Disaster Recovery: Data centers are appropriate for organizations with proprietary data and/or with business needs of redundancy. A data center typically has the IT and security infrastructure in place to ensure business continuity. When located in a geographically different location than the primary data center, this strategy will mitigate the risk of a fire or a large natural disaster.
Virtualized Disaster Recovery: Virtualized disaster recovery enables organizations to back up their data and workloads to off-site virtual machines. This strategy is beneficial because these platforms are generally available in the event of a disaster and can help organizations meet short recovery time objectives.
Disaster recovery as a service (DRaaS): DRaaS is a cloud-based disaster recovery service that allows organizations to replicate computer systems and critical business operations to the cloud in an off-site location. Typically, service-level agreements with the DRaaS vendor will determine response times and direct support in the event of a disaster.
Disaster Recovery Best Practices
Disaster recovery relies on having a well-executed plan to quickly get your applications and systems up and running. An effective disaster recovery strategy will enable an organization to completely address the three elements of disaster recovery:
Prevention
A critical aspect of disaster recovery is reducing the risk of a disaster impacting your business's continuity. These risks generally fall into network risks, security risks, and the risk of human error. Build into your disaster recovery plan risk-reducing tools and techniques such as continuous backups and software that automatically checks the environment for configuration errors.
Detection
To recover as quickly as possible, you will need to know when to respond. Build into your disaster recovery plan a recovery time objective that states the complete response time for recovery (e.g., one hour). This will necessitate quick detection and notification. Modern system-wide monitoring tools can detect system anomalies in real time, examine potential impacts, and notify the right people.
Correction
Correction or mitigation is how an organization responds after a disaster has struck. What takes place after a disaster can be more important than prevention and detection as it can strengthen business operations with updated testing scenarios, coordinate and train the right people on what has been learned, and make any changes to systems to make them more resilient.
Learn more about disaster recovery by exploring these related resources on KnowledgeLeader: