When a business continuity plan is non-functional
Failed Business Continuity – This morning about 2:00 AM MST one of the largest providers of cloud services went down. As I write this it is 11:30 AM MST and the service is still down.
It seems that their entire network – both the east coast and west coast is down. I talked to their corporate office and at this time they have no idea as to when they will be back up and at the same time the person I talked to said he did not know what their business continuity plan was since this was a nation-wide failure in their network.
They should have followed the 10 commandments that we published earlier.
- Analyze single points of failure: A single point of failure in a critical component can disrupt well engineered redundancies and resilience in the rest of a system.
- Keep updated notification trees: A cohesive communication process is required to ensure the disaster recovery business continuity plan will work.
- Be aware of current events: Understand what is happening around the enterprise – know if there is a chance for a weather, sporting or political event that can impact the enterprise’s operations.
- Plan for worst-case scenarios: Downtime can have many causes, including operator error, component failure, software failure, and planned downtime as well as building- or city-level disasters. Organizations should be sure that their disaster recovery plans account for even worst-case scenarios.
- Clearly document recovery processes: Documentation is critical to the success of a disaster recovery program. Organizations should write and maintain clear, concise, detailed steps for failover so that secondary staff members can manage a failover should primary staff members be unavailable.
- Centralize information – Have a printed copy available: In a crisis situation, a timely response can be critical. Centralizing disaster recovery information in one place, such as a Microsoft Office SharePoint® system or portal or cloud, helps avoid the need to hunt for documentation, which can compound a crisis.
- Create test plans and scripts: Test plans and scripts should be created and followed step-by-step to help ensure accurate testing. These plans and scripts should include integration testing silo testing alone does not accurately reflect multiple applications going down simultaneously.
- Retest regularly: Organizations should take advantages of opportunities for disaster recovery testing such as new releases, code changes, or upgrades. At a minimum, each application should be retested every year.
- Perform comprehensive recovery and business continuity test: Organizations should practice their master recovery plans, not just application failover. For example, staff members need to know where to report if a disaster occurs, critical conference bridges should be set up in advance, a command center should be identified, and secondary staff resources should be assigned in case the event stretches over multiple days. In environments with many applications, IT staff should be aware of which applications should be recovered first and in what order. The plan should not assume that there will be enough resources to bring everything back up at the same time.
- Defined metrics and create score cards scores: Organizations should maintain scorecards on the disaster recovery compliance of each application, as well as who is testing and when. Maintaining scorecards generally helps increase audit scores.