Disaster recovery testing is not always a favorite topic of IT professionals. For administrators it often requires late nights or weekend work and for managers it can be costly and disruptive to business operations. Ironically, organizations put HA/DR solutions and plans into effect to keep applications and business services up and available, while the testing of these plans can mean downtime. Another issue is simply keeping the secondary systems or the DR site in compliance or configured properly in relation to the primary systems or location. Simple configuration changes on the primary systems or site, if not mirrored on the secondary systems, can have major consequences on the organization's capability to failover and recover properly if an incident, whether it is a major catastrophe or a simple error, occurs. Oh and don't forget the speed of recovery. Some applications require very stringent Recovery Time Objectives (RTO) that must be met for business purposes.
To address these challenges, organizations have several choices to help them test their disaster recovery readiness such as walkthroughs, tabletop exercises, simulations and full tests. These choices are defined as follows:
- Walkthroughs - During a walkthrough, key stakeholders in the plan meet to review the layout and contents of a plan. These aren't really "tests." They won't validate your technology or validate your recovery capabilities, but they are good exercises to familiarize stakeholders with their roles and responsibilities in the plan.
- Tabletop exercises - These rehearse a specific threat scenario. They're similar to plan walkthroughs, but suggest a pandemic, flood, hazardous material accident, or other trigger event so participants can discuss their response and recovery activities in the plan.
- Simulation - During a simulation, the DR manager invokes the plan in a controlled situation that does not impact business operations. A common approach to simulation involves the use of data replicas at the recovery site. IT professionals briefly suspend data replication between the production and recovery sites to create a replica of production data using storage or server-based snapshot/cloning technology. Then replication is resumed. The production replicas are then mounted to redundant servers at the recovery site, and applications and IT systems are recovered and restarted using the replicas. Business and application users perform functional tests on these alternate systems.
- Full test - During a full test, IT professionals perform an actual failover of IT systems and end-user processing to the recovery site. This truly tests the DR plan but is risky because it will impact production if the cutover fails. Plus, you have to successfully fail back once the test is complete. IT professionals will find that business owners are wary of scheduling and performing these types of tests, despite their inherent value.
All of these processes for DR testing have their pros and cons. In general, enterprises need to develop test strategies that leverage all the test types. IT professionals should conduct walkthroughs and tabletop exercises quarterly and simulations at least twice per year or whenever there is a major configuration change in the environment. IT professionals should strive to conduct full tests once per year. Full tests are easier with certain types of active-active data center configurations. For example, companies that execute planned workload rotations between data centers are very confident in their ability to execute DR plans because the failover is now a regular prat of IT operations.
Generally speaking, Symantec recommends the following best practices for DR testing and planning:
- Test regularly - more is better! However, in order to achieve this, IT requires a solution that is non-disruptive and transparent. You don't want to take your primary applications off-line if you can avoid it. Especially if those application are business critical.
- Test using different personnel. Make sure all of your people are familiar and know their role if a problem occurs. It's also important to see if you can implement tools that support all of the platforms and applications you are running. That way training and knowledge in a crisis is less of an issue, as people will know what to do when it counts.
- Test after significant changes to the infrastructure. Even the most thorough IT organizations are bound to miss something when dealing with complex architectures found in large enterprise data centers. Ensure that nothing has been left to chance and use automation where you can.
- If your test fails, re-test to make sure you can meet your objectives. If you have the right tools in place, re-testing should be less painful and give you peace of mind knowing your organization is prepared if a real incident occurs.
Of course this wouldn’t be a Connect blog without mentioning how Symantec provides solutions to help you take the risk and pain out of DR testing. First, Fire Drill is a feature included in Veritas Cluster Server that fails applications over to the secondary site on testing basis. Once the applications are up and running at the secondary site, users are connected to see if the application and data are available as expected. This can all be done non-disruptively, meaning the applications are still running at the primary site and normal operations are still occurring, without an impact to performance. The test is transparent to end users and customers.
Another solution that Symantec offers is called Disaster Recovery Advisor. Disaster Recovery Advisor helps IT organization to identify any configuration gaps between the primary and secondary systems. These configuration checks work for systems used for local high availability, as well as geographically dispersed data centers that are required for wide area disaster recovery. If there are configuration issues, Disaster Recovery Advisor will report the problems and suggest remedies to fix them. Like Fire Drill, these configuration scans and checks are non-disruptive and transparent to end users.
With automated solutions in place, you can test your DR plans without interruption to business at normal operating hours, removing the need for admins to spend late nights and weekends working on DR tests. You can test and re-test whenever you want, without interruption to service, while including as many people in the procedure as you need to. With Fire Drill and Disaster Recovery Advisor you will have full results from the testing reports, so you can remedy any problems before a real crisis strikes. If knowing your applications will recover when you need them to is a concern, please find more information see the following sites for more information: