Skip to content
TRAC-Logo
 

Frustration-Free Risk Management

Simplify cybersecurity risk management and tackle your cybersecurity challenges with ease. TRAC automates the tedious risk assessment process and produces customized results that align with regulations, best practices, and your strategic goals.

Two people working on a desktop.
Terry KuxhausJune 30, 202013 min read

How to Mature Your Disaster Recovery Testing Plan

A challenge many organizations face is understanding if and how they would recover from a disaster or malware event that takes down the production IT infrastructure or datacenter. In today’s workplace, nearly every organization is heavily reliant on IT and may not be able to conduct business without it. This was a challenge I faced while managing the IT infrastructure at a midsized community bank. Here are some guidelines to help plan, prepare, and test for the unforeseen disaster and keep your business afloat.

 

Identify Your Objectives

The first phase in the process is identifying how reliant the organization is on IT. Can you service your customers if all IT resources are down? Most likely not, which is why it’s important to set some goals, requirements, and priorities around the recovery of those resources. Management should be heavily involved in making these determinations. Your organization should already have a quality Business Impact Analysis (BIA), which is used to help make decisions and set priorities around the recovery of your most important business processes. The SBS reference library has previously covered how to create a valuable BIA, but keep in mind these recovery factors:

  • Impacts -customer, financial, legal, and recovery resources
  • Time Frames -Recovery Point Objective (RPO), Recovery Time Objective (RTO) and Maximum Tolerable Downtime (MTD)

 

Planning Your DR Test

You need to determine how to meet your stated recovery objectives. Do you have a Disaster Recovery (DR) site? Are you relying on a managed service provider (MSP) to provide this capability? Are you confident your data is being properly backed up and can be recovered? Is your infrastructure able to meet these requirements?


There are many ways to structure the DR capabilities of an organization to meet the objectives and priorities. The smaller window of downtime allowed generally equates to how costly the solution is to build and deploy.

  • Onsite Manual Restore – Backing all of your data up locally and restoring onsite manually requires very little cost upfront, but takes a long time to procure and build the infrastructure. Additionally, onsite manual restoration is very difficult to test. It may take weeks to recover using an onsite manual restore method.
  • Cold Site – An alternate location that has some equipment available, but the equipment is not running or configured. Infrastructure build-out is required, along with data recovery and restoring to a Cold Site, which can be a lengthy recovery process.
  • Warm Site – A Warm Site has infrastructure and connectivity in place and running, ready to have data restored to it. If your organization has multiple locations, often-times one of your alternate locations can be used as a DR facility with redundant equipment. The Warm Site method does require infrastructure and software, which adds cost.
  • Hot Site – For organizations that require very small windows of downtime, a Hot Site is needed. A Hot Site requires redundant infrastructure, mirroring production with constant data replication, which must be able to take the production load with very brief downtime.
  • DRaaS: Disaster Recovery as a Service- DRaaS is a new concept in DR planning. For organizations that don’t want to maintain a DR site and the headaches associated, you can partner with a DRaaS provider. DRaaS provides cloud-based infrastructure to act as the target for data replication and recovery of critical data and business applications. This can be a great solution for any size organization that doesn’t want to maintain their own DR site yet maintain very high uptime guarantees.
  • MSP provided DR- Organizations that have their infrastructure hosted by a managed service provider (Infrastructure as a Service (IaaS)) can rely on their provider to maintain a DR site. It’s very important to make sure you have a documented agreement in place with the MSP stating how the recovery process will be structured with a Service Level Agreement (SLA) detailing the priorities and requirements. SLAs should define the frequency and type of testing performed, along with any penalties if the objectives are not met.

 

Data Backup and Replication

Another key component in DR planning is making sure you have a solid data backup process in place. Your BIA will help determine if data replication is needed to your DR site in order to meet RTO requirements. A best practice in data backups is the 3-2-1 rule. Maintain at least three (3) copies of your data, on two (2) different types of media, and one (1) copy offsite. Make sure to isolate a backup copy from being accessible to malware/ransomware. Backups should be tested monthly to verify data can be restored and integrity is intact.

 

Howtomatureyourdisasterrecovery

 

The retention schedule of backups will vary by organization, but 14 days would be a minimum guideline to follow. The recommended retention schedule, more commonly, would be 30 days plus quarterly and yearly archives.

 

Draft a Plan

An IT DR Plan is not the same as a Business Continuity Plan (BCP). The two may certainly complement each other, and for smaller organizations, the two plans may be combined. The IT DR plan should provide specific procedures for restoring the IT infrastructure, whereas the BCP is an organizational guide on how to respond to business interruptions.


The IT DR plan should include detailed procedures on the recovery process of the items identified in the BIA and follow the specified priorities. Your IT DR Plan should include procedures for restoring servers, endpoints, voice communications, network infrastructure, and any other technology your organization needs to maintain business operations. In addition to the core application, make sure other critical processes and applications are included. If you utilize an MSP for DR, you will need to coordinate with your provider on the content in the plan. Make sure the DR Plan includes provisions for potential issues, as recovery tends not to go as planned. It is very helpful to generate a checklist to be utilized during testing. This checklist will help manage the tasks that need to be accomplished during the testing. A list of all critical vendors and support resources should be included with contact information.


Your DR Plan should also include how testing will be performed to validate a successful recovery and what will be tested. Defining your DR Plan testing objectives is another very important decision that management must help make. Is it acceptable to restore servers and infrastructure and verify applications and data are accessible without performing transactions? Some larger organizations with more stringent uptime requirements may choose to go further with testing. If minimizing downtime is critical to your organization, you may choose to do a full restore of all critical business processes and run in a live environment at the DR site for a period of time. Before making that determination, you need to understand the risks associated with doing so. When doing a production test of this nature, many issues can arise, causing degraded performance, data loss, and downtime.


In addition, you need to plan for the transition back to the normal production network when testing is complete.

 

Types of DR Tests

One of the most important aspects of preparing for a disaster is testing. DR Testing will validate whether or not your organization can recover business processes that meet your recovery objectives, including recovering and restoring your data. DR Testing will also help determine if the current DR process can meet the recovery time objectives specified in the BIA. DR and BCP testing should be performed annually at a minimum. Some larger organizations may require more frequent testing.


There are several phases in the DR Testing process, which may vary depending on how the DR infrastructure is structured (Cold Site will take a lot more resources and procedures than a Hot Site that has everything built and running).

  1. DR Plan Review – Start by reviewing the plan with limited participants to make sure the plan is accurate and understood. If using an MSP, they will be heavily involved in all phases.
  2. DR Walkthrough - Once verified, schedule a walkthrough with key personnel to further vet the plan for missing information or processes.
  3. DR Tabletop Testing - Now you’re ready for a tabletop with the DR team. The tabletop test is when you will gather the troops, identify responsibilities, and make sure everyone is able to perform the required actions. This phase would include all participants needed for application testing and data validation. It’s important to have clearly defined procedures and checklists for validation testing.
  4. Mock Testing – Before conducting an actual DR test, it’s recommended to do small scale tests of portions of the infrastructure to verify the recovery process works. For example: If you are replicating virtual servers to the DR site, you may want to spin up a few VM’s at the DR site to verify the replication is working correctly, and the infrastructure can support the servers (be careful with duplicate IPs and host names to avoid impacting the production servers).
  5. Parallel Test – In a parallel test, the DR environment is used to restore the infrastructure in a manner that has little impact on the production environment. The production network continues to function while the DR infrastructure is being tested, and no live transactions are made to the DR infrastructure. Network isolation may be required if duplicate IP addressing is being used by the recovered infrastructure. Generally, parallel testing is performed during off-hours to avoid production impacts. Many organizations choose to do parallel testing rather than a full failover testing due to the impact and risks associated with processing production transactions on the DR environment.
  6. Full Failover Test – The full failover DR test is the epitome of DR testing. If you are able to failover to your DR site, successfully process and perform business operations from the DR site, and then fail back again, your organization should be able to weather most disasters smoothly from a business-standpoint.


If your organization has very strict recovery guidelines, you may be required to perform a full production failover to the DR site. In this type of test, the DR site becomes the production site, and live transactions and changes are made to the DR environment. Generally, this environment is more complex and costly to build with very specific steps that may be required to perform the switchover. As mentioned above, there are significant risks and impacts that this may introduce, the infrastructure needs to be able to synchronize back to the normal production network without data loss or integrity issues.


The catch to full failover testing is that it typically takes numerous iterations of testing to achieve a true full failover, which may take years to achieve (depending on your time and resources). While full failover testing is a tremendous undertaking, you’ll learn a lot of valuable lessons along the way.

 

Communication

Communication is a key piece before, during, and after all phases of testing. It’s vitally important to make sure management is aware of the process, have been involved in the planning and scheduling, and have been informed of potential impacts of the testing. The test should be scheduled well in advance, and all affected parties notified and included as needed in the testing. Don’t forget to involve critical third parties in the testing and potentially audit to validate results (as an alternative to audit, screenshots can be captured for validation).

 

Performing Your DR Test

Be sure to verify the test environment at least a week before the actual test, including:

  • DR Systems should be accessible and able to handle the workload of the recovery processes.
  • Patches and updates should be installed to match production.


Items needed during testing:

  • IT DR Plan
  • Checklist
  • Important Contact List including vendors
  • Network Diagram
  • Internet Access for research if needed

 

The DR plan and checklist should be your primary run book during the test. If you identify missing or inaccurate information, take the time to make notes so it can be corrected once testing is complete. It may be beneficial to designate an individual as an official note-taker to make sure all activities are documented with a timestamp. The plan should clearly state what steps should be taken and what recovery priorities exist. As the test progresses, gaps will likely be identified in the plan, especially if being utilized the first time.


Before the testing begins, communicate with all participants that you want to keep the process very methodical and organized without any sense of panic and chaos. The participants should all know their roles and responsibilities in advance of the actual testing. Designate one person (generally the manager) as the lead during testing to step the team through each phase of the plan and guide the activities.


While the test is being performed, keep the team informed on the status of the recovery processes and where you are in the plan. If issues are being identified and addressed as you’re testing, other dependencies may be impacted and slow the overall progress. Alternate assignments may need to be made to assist with issues to utilize all resources and keep the recovery timeframes as short as possible. As systems are being recovered, verify functionality and data integrity as required for verification. A screenshot may be used for validation, including a timestamp, which is also used to identify if RTOs are being met.


Once the testing is complete, follow the plan to revert all systems back to production mode.


Very Important: When DR Testing is complete and all changes reverted back to production, verify production systems are working normally, and data is accessible.


It’s a good idea to inform management that the testing is complete and provide a summary of the outcome.

 

Reporting – Lessons Learned

After the completion of testing, the team should meet to discuss key takeaways. It’s important to take this opportunity to improve the recovery process and utilize lessons learned as a guide.

  • If issues were encountered during testing, how might the same issues be avoided in a live scenario?
  • What improvements can be made to the process to improve disaster recovery timeframes or procedures in the future?
  • Were you able to meet your RTOs?
  • Did the DR Plan provide proper guidance?
  • Did you have the resources needed to recover effectively?
  • Are there technology issues that need to be addressed?


Update the DR Plan accordingly to incorporate any additions and improvements. The results of the testing and lessons learned should be formally documented and provided to management/Board of Directors.


The report should include the following topics:

  • Scope and objectives
  • Resources utilized
  • Functionality in DR and/or limitations
  • Issues identified and what improvements are needed
  • If retesting is required and when
  • Overall summary

 

Summary

The goal is to make sure your organization will be able to recover IT in a timely manner when a disaster occurs. Research consistently shows that loss of IT functions in a disaster leads to business failure. For example, approximately 93 percent of companies that lose their computer systems for 10 days or more due to a disaster file for bankruptcy within one year of the event, according to the U.S. National Archives & Records Administration.


It is vitally important to have open communication and management involvement when preparing your DR Plan and testing. This process should not be the responsibility of one individual, but rather a group effort.

Effective disaster recovery testing can reduce downtime and save your organization time and money in the event of an actual disaster, but getting to the point of creating a valuable DR testing process can take time. As with all things valuable, maturing your DR testing processes doesn’t happen overnight, but increasing your confidence in your ability to recover from a disaster is worth the effort.

 
 

 

avatar

Terry Kuxhaus

Terry Kuxhaus is an Information Security Consulting Team Lead at SBS CyberSecurity. He is also an instructor for the SBS Institute, leading the Certified Banking Vulnerability Assessor (CBVA) course.

RELATED ARTICLES