Skip to main content

Resources

How to Mature Your Disaster Recovery Testing Plan

How to Mature Your Disaster Recovery Testing Plan

A challenge many organizations face is understanding if and how they would recover from a disaster or malware event that takes down the production IT infrastructure or datacenter. In today’s workplace, nearly every organization is heavily reliant on IT and may not be able to conduct business without it. This was a challenge I faced while managing the IT infrastructure at a midsized community bank. Here are some guidelines to help plan, prepare, and test for the unforeseen disaster and keep your business afloat.

 

Identify Your Objectives

The first phase in the process is identifying how reliant the organization is on IT. Can you service your customers if all IT resources are down? Most likely not, which is why it’s important to set some goals, requirements, and priorities around the recovery of those resources. Management should be heavily involved in making these determinations. Your organization should already have a quality Business Impact Analysis (BIA), which is used to help make decisions and set priorities around the recovery of your most important business processes. The SBS reference library has previously covered how to create a valuable BIA, but keep in mind these recovery factors:

  • Impacts -customer, financial, legal, and recovery resources
  • Time Frames -Recovery Point Objective (RPO), Recovery Time Objective (RTO) and Maximum Tolerable Downtime (MTD)

 

Planning Your DR Test

You need to determine how to meet your stated recovery objectives. Do you have a Disaster Recovery (DR) site? Are you relying on a managed service provider (MSP) to provide this capability? Are you confident your data is being properly backed up and can be recovered? Is your infrastructure able to meet these requirements?


There are many ways to structure the DR capabilities of an organization to meet the objectives and priorities. The smaller window of downtime allowed generally equates to how costly the solution is to build and deploy.

  • Onsite Manual Restore – Backing all of your data up locally and restoring onsite manually requires very little cost upfront, but takes a long time to procure and build the infrastructure. Additionally, onsite manual restoration is very difficult to test. It may take weeks to recover using an onsite manual restore method.
  • Cold Site – An alternate location that has some equipment available, but the equipment is not running or configured. Infrastructure build-out is required, along with data recovery and restoring to a Cold Site, which can be a lengthy recovery process.
  • Warm Site – A Warm Site has infrastructure and connectivity in place and running, ready to have data restored to it. If your organization has multiple locations, often-times one of your alternate locations can be used as a DR facility with redundant equipment. The Warm Site method does require infrastructure and software, which adds cost.
  • Hot Site – For organizations that require very small windows of downtime, a Hot Site is needed. A Hot Site requires redundant infrastructure, mirroring production with constant data replication, which must be able to take the production load with very brief downtime.
  • DRaaS: Disaster Recovery as a Service- DRaaS is a new concept in DR planning. For organizations that don’t want to maintain a DR site and the headaches associated, you can partner with a DRaaS provider. DRaaS provides cloud-based infrastructure to act as the target for data replication and recovery of critical data and business applications. This can be a great solution for any size organization that doesn’t want to maintain their own DR site yet maintain very high uptime guarantees.
  • MSP provided DR- Organizations that have their infrastructure hosted by a managed service provider (Infrastructure as a Service (IaaS)) can rely on their provider to maintain a DR site. It’s very important to make sure you have a documented agreement in place with the MSP stating how the recovery process will be structured with a Service Level Agreement (SLA) detailing the priorities and requirements. SLAs should define the frequency and type of testing performed, along with any penalties if the objectives are not met.

 

Data Backup and Replication

Another key component in DR planning is making sure you have a solid data backup process in place. Your BIA will help determine if data replication is needed to your DR site in order to meet RTO requirements. A best practice in data backups is the 3-2-1 rule. Maintain at least three (3) copies of your data, on two (2) different types of media, and one (1) copy offsite. Make sure to isolate a backup copy from being accessible to malware/ransomware. Backups should be tested monthly to verify data can be restored and integrity is intact.

 

321 Backup Rule

 

The retention schedule of backups will vary by organization, but 14 days would be a minimum guideline to follow. The recommended retention schedule, more commonly, would be 30 days plus quarterly and yearly archives.

 

Draft a Plan

An IT DR Plan is not the same as a Business Continuity Plan (BCP). The two may certainly complement each other, and for smaller organizations, the two plans may be combined. The IT DR plan should provide specific procedures for restoring the IT infrastructure, whereas the BCP is an organizational guide on how to respond to business interruptions.


The IT DR plan should include detailed procedures on the recovery process of the items identified in the BIA and follow the specified priorities. Your IT DR Plan should include procedures for restoring servers, endpoints, voice communications, network infrastructure, and any other technology your organization needs to maintain business operations. In addition to the core application, make sure other critical processes and applications are included. If you utilize an MSP for DR, you will need to coordinate with your provider on the content in the plan. Make sure the DR Plan includes provisions for potential issues, as recovery tends not to go as planned. It is very helpful to generate a checklist to be utilized during testing. This checklist will help manage the tasks that need to be accomplished during the testing. A list of all critical vendors and support resources should be included with contact information.


Your DR Plan should also include how testing will be performed to validate a successful recovery and what will be tested. Defining your DR Plan testing objectives is another very important decision that management must help make. Is it acceptable to restore servers and infrastructure and verify applications and data are accessible without performing transactions? Some larger organizations with more stringent uptime requirements may choose to go further with testing. If minimizing downtime is critical to your organization, you may choose to do a full restore of all critical business processes and run in a live environment at the DR site for a period of time. Before making that determination, you need to understand the risks associated with doing so. When doing a production test of this nature, many issues can arise, causing degraded performance, data loss, and downtime.


In addition, you need to plan for the transition back to the normal production network when testing is complete.

 

Types of DR Tests

One of the most important aspects of preparing for a disaster is testing. DR Testing will validate whether or not your organization can recover business processes that meet your recovery objectives, including recovering and restoring your data. DR Testing will also help determine if the current DR process can meet the recovery time objectives specified in the BIA. DR and BCP testing should be performed annually at a minimum. Some larger organizations may require more frequent testing.


There are several phases in the DR Testing process, which may vary depending on how the DR infrastructure is structured (Cold Site will take a lot more resources and procedures than a Hot Site that has everything built and running).

  1. DR Plan Review – Start by reviewing the plan with limited participants to make sure the plan is accurate and understood. If using an MSP, they will be heavily involved in all phases.
  2. DR Walkthrough - Once verified, schedule a walkthrough with key personnel to further vet the plan for missing information or processes.
  3. DR Tabletop Testing - Now you’re ready for a tabletop with the DR team. The tabletop test is when you will gather the troops, identify responsibilities, and make sure everyone is able to perform the required actions. This phase would include all participants needed for application testing and data validation. It’s important to have clearly defined procedures and checklists for validation testing.
  4. Mock Testing – Before conducting an actual DR test, it’s recommended to do small scale tests of portions of the infrastructure to verify the recovery process works. For example: If you are replicating virtual servers to the DR site, you may want to spin up a few VM’s at the DR site to verify the replication is working correctly, and the infrastructure can support the servers (be careful with duplicate IPs and host names to avoid impacting the production servers).
  5. Parallel Test – In a parallel test, the DR environment is used to restore the infrastructure in a manner that has little impact on the production environment. The production network continues to function while the DR infrastructure is being tested, and no live transactions are made to the DR infrastructure. Network isolation may be required if duplicate IP addressing is being used by the recovered infrastructure. Generally, parallel testing is performed during off-hours to avoid production impacts. Many organizations choose to do parallel testing rather than a full failover testing due to the impact and risks associated with processing production transactions on the DR environment.
  6. Full Failover Test – The full failover DR test is the epitome of DR testing. If you are able to failover to your DR site, successfully process and perform business operations from the DR site, and then fail back again, your organization should be able to weather most disasters smoothly from a business-standpoint.


If your organization has very strict recovery guidelines, you may be required to perform a full production failover to the DR site. In this type of test, the DR site becomes the production site, and live transactions and changes are made to the DR environment. Generally, this environment is more complex and costly to build with very specific steps that may be required to perform the switchover. As mentioned above, there are significant risks and impacts that this may introduce, the infrastructure needs to be able to synchronize back to the normal production network without data loss or integrity issues.


The catch to full failover testing is that it typically takes numerous iterations of testing to achieve a true full failover, which may take years to achieve (depending on your time and resources). While full failover testing is a tremendous undertaking, you’ll learn a lot of valuable lessons along the way.

 

Communication

Communication is a key piece before, during, and after all phases of testing. It’s vitally important to make sure management is aware of the process, have been involved in the planning and scheduling, and have been informed of potential impacts of the testing. The test should be scheduled well in advance, and all affected parties notified and included as needed in the testing. Don’t forget to involve critical third parties in the testing and potentially audit to validate results (as an alternative to audit, screenshots can be captured for validation).

 

Performing Your DR Test

Be sure to verify the test environment at least a week before the actual test, including:

  • DR Systems should be accessible and able to handle the workload of the recovery processes.
  • Patches and updates should be installed to match production.


Items needed during testing:

  • IT DR Plan
  • Checklist
  • Important Contact List including vendors
  • Network Diagram
  • Internet Access for research if needed

 

The DR plan and checklist should be your primary run book during the test. If you identify missing or inaccurate information, take the time to make notes so it can be corrected once testing is complete. It may be beneficial to designate an individual as an official note-taker to make sure all activities are documented with a timestamp. The plan should clearly state what steps should be taken and what recovery priorities exist. As the test progresses, gaps will likely be identified in the plan, especially if being utilized the first time.


Before the testing begins, communicate with all participants that you want to keep the process very methodical and organized without any sense of panic and chaos. The participants should all know their roles and responsibilities in advance of the actual testing. Designate one person (generally the manager) as the lead during testing to step the team through each phase of the plan and guide the activities.


While the test is being performed, keep the team informed on the status of the recovery processes and where you are in the plan. If issues are being identified and addressed as you’re testing, other dependencies may be impacted and slow the overall progress. Alternate assignments may need to be made to assist with issues to utilize all resources and keep the recovery timeframes as short as possible. As systems are being recovered, verify functionality and data integrity as required for verification. A screenshot may be used for validation, including a timestamp, which is also used to identify if RTOs are being met.


Once the testing is complete, follow the plan to revert all systems back to production mode.


Very Important: When DR Testing is complete and all changes reverted back to production, verify production systems are working normally, and data is accessible.


It’s a good idea to inform management that the testing is complete and provide a summary of the outcome.

 

Reporting – Lessons Learned

After the completion of testing, the team should meet to discuss key takeaways. It’s important to take this opportunity to improve the recovery process and utilize lessons learned as a guide.

  • If issues were encountered during testing, how might the same issues be avoided in a live scenario?
  • What improvements can be made to the process to improve disaster recovery timeframes or procedures in the future?
  • Were you able to meet your RTOs?
  • Did the DR Plan provide proper guidance?
  • Did you have the resources needed to recover effectively?
  • Are there technology issues that need to be addressed?


Update the DR Plan accordingly to incorporate any additions and improvements. The results of the testing and lessons learned should be formally documented and provided to management/Board of Directors.


The report should include the following topics:

  • Scope and objectives
  • Resources utilized
  • Functionality in DR and/or limitations
  • Issues identified and what improvements are needed
  • If retesting is required and when
  • Overall summary

 

Summary

The goal is to make sure your organization will be able to recover IT in a timely manner when a disaster occurs. Research consistently shows that loss of IT functions in a disaster leads to business failure. For example, approximately 93 percent of companies that lose their computer systems for 10 days or more due to a disaster file for bankruptcy within one year of the event, according to the U.S. National Archives & Records Administration.


It is vitally important to have open communication and management involvement when preparing your DR Plan and testing. This process should not be the responsibility of one individual, but rather a group effort.

Effective disaster recovery testing can reduce downtime and save your organization time and money in the event of an actual disaster, but getting to the point of creating a valuable DR testing process can take time. As with all things valuable, maturing your DR testing processes doesn’t happen overnight, but increasing your confidence in your ability to recover from a disaster is worth the effort.

 

 


Written by: 
Terry Kuxhaus
Information Security Consultant - SBS CyberSecurity, LLC 


 

SBS Resources: 
SBS CyberSecurity has been helping organizations identify and understand cybersecurity risks to make more informed business decisions since 2004. If your organization is looking to better understand your cyber risk; build, maintain, or test your cybersecurity program; and make smarter, more informed cybersecurity business decisions, SBS can help.

  • {Blog} What Does a Good BIA Look Like?: When creating a BIA, there are going to be three (3) main components that you should address to get the best results, including 1) Impacts, 2) Timeframes, and 3) Dependencies. This article will cover each of these BIA components, along with a little information on your business processes themselves. Read blog.
  • {Blog} How to Gain Additional Value from Your BIA: A lot of effort goes into building out a BIA that meets regulation, you might as well make sure you are using that information to benefit your overall Business Continuity Plan, help you mitigate additional risk to your organization, and make better business decisions. Read blog.
  • {Service} Business Continuity Planning: A key piece to any Information Security Program is a high-quality Business Continuity Plan (BCP). Let SBS help design and test a comprehensive plan that encompasses four areas: Business Impact Analysis, Business Continuity, Disaster Recovery, and Pandemic Preparedness. A well-structured plan can help mitigate the negative effects of a natural disaster, unexpected power outage, widespread illness, and many other unexpected events. Learn more.
  • Incident Response Assistance: If your organization needs immediate assistance with an active incident or security breach situation, call 605-923-8722 to speak to our Incident Response Team

 

Related Certifications:

Join our growing community of financial service professionals showing their commitment to strong cybersecurity with a cyber-specific certification through the SBS Institute. Click here to view a full list of certifications.

Certified Banking Business Continuity Professional    Certified Banking Incident Handler


Hacker Hour webinars are a series of free webinars hosted by SBS CyberSecurity. Unlike paid webinars, Hacker Hours are aimed to meet on a monthly basis to discuss cybersecurity issues and trends in an open format. Attendees are encouraged to join the conversation and get their questions answered. SBS will also offer products and services to help financial institutions with these specific issues.

Posted: Tuesday, June 30, 2020
Categories: Blog