Those who violate 10 commandments of business continuity plan fail

When a business continuity plan is non-functional

Failed Business Continuity – This morning about 2:00 AM MST one of the largest providers of cloud services went down.  As I write this it is 11:30 AM MST and the service is still down.

It seems that their entire network – both the east coast and west coast is down.  I talked to their corporate office and at this time they have no idea as to when they will be back up and at the same time the person I talked to said he did not know what their business continuity plan was since this was a nation-wide failure in their network.

They should have followed the 10 commandments that we published earlier.

  1. Analyze single points of failure: A single point of failure in a critical component can disrupt well engineered redundancies and resilience in the rest of a system.
  2. Keep updated notification trees: A cohesive communication process is required to ensure the disaster recovery business continuity plan will work.
  3. Be aware of current events: Understand what is happening around the enterprise – know if there is a chance for a weather, sporting or political event that can impact the enterprise’s operations.
  4. Plan for worst-case scenarios: Downtime can have many causes, including operator error, component failure, software failure, and planned downtime as well as building- or city-level disasters. Organizations should be sure that their disaster recovery plans account for even worst-case scenarios.
  5. Clearly document recovery processes: Documentation is critical to the success of a disaster recovery program. Organizations should write and maintain clear, concise, detailed steps for failover so that secondary staff members can manage a failover should primary staff members be unavailable.
  6. Centralize information – Have a printed copy available: In a crisis situation, a timely response can be critical. Centralizing disaster recovery information in one place, such as a Microsoft Office SharePoint® system or portal or cloud, helps avoid the need to hunt for documentation, which can compound a crisis.
  7. Create test plans and scripts: Test plans and scripts should be created and followed step-by-step to help ensure accurate testing. These plans and scripts should include integration testing— silo testing alone does not accurately reflect multiple applications going down simultaneously.
  8. Retest regularly: Organizations should take advantages of opportunities for disaster recovery testing such as new releases, code changes, or upgrades. At a minimum, each application should be retested every year.
  9. Perform comprehensive recovery and business continuity test: Organizations should practice their master recovery plans, not just application failover. For example, staff members need to know where to report if a disaster occurs, critical conference bridges should be set up in advance, a command center should be identified, and secondary staff resources should be assigned in case the event stretches over multiple days. In environments with many applications, IT staff should be aware of which applications should be recovered first and in what order. The plan should not assume that there will be enough resources to bring everything back up at the same time.
  10. Defined metrics and create score cards scores: Organizations should maintain scorecards on the disaster recovery compliance of each application, as well as who is testing and when. Maintaining scorecards generally helps increase audit scores.

Order Disaster Plan TemplateDisaster Plan Sample

 

Author: Victor Janulaitis

M. Victor Janulaitis is the CEO of Janco Associates. He has taught at the USC Graduate School of Business, been a guest lecturer at the UCLA’s Anderson School of Business, a Graduate School at Harvard University, and several other universities in various programs.

2 thoughts on “Those who violate 10 commandments of business continuity plan fail”

  1. Open Letter to Intermedia.net’s management

    I see from your site that you have a lot of certifications but they do not mean anything. You had a single point of failure that was not adequately planned for and your staff responded accordingly. We at least had secondary email addresses and after one hour changed the appropriate records to route our email to another provider.

    The fact that you did not have a functional Business Continuity process in place is appalling. You sell your service with the notion that you have 99.99% availability. The purpose of a good plan is to deal with the .01%. The issue is how QUICKLY you come back up and that the problem resolution process is not part of the problem. We were down with no email for almost 12 hours and that is totally unacceptable.

    From what I can see the following are issues:

    1. Your first fix failed and it took too long for your staff to realize it – who was “center posting” the customer service level and RTO?

    2. From what I see your organization could NEVER have done a test of the conditions that occurred. What if you had a hardware problem on your core router – do you have a spare in place? Could you get it up as a secondary path to a solution as you tried to figure out what happened?

    3. Your phone support was NON-Existent and you phone went directly to a busy signal. Why do you not have at least some other lines people could use – that is a simple router change or did you have all of your eggs in one basket?

    4. You could have POSTED a LARGE banner on you site’s home page to look at your twitter feed or Facebook feed to get status. During most of the outage your home page did display.

    There are lots of things that could have been done, however they were not and it cost us dearly. You should FIRE whoever created your business continuity plan and get professionals in there who know what they are doing.

    FYI – I with a consulting team created the disaster recovery business continuity plan that Merrill Lynch had to implement on 9/11. As a result of proper planning and redundancy ML lost on 52 seconds of transactions and would have been able to run all transactions if the exchanges had not shut down.

  2. We were down for over 12 hours today. Intermedia told us they had a network down issue and pointed us to their blog on status.

    We went to the blog and it was DOWN. What a 3rd class operation they are. We do not know if the issue was hardware, software, human error, or router. There is no assurance that this will not happen again and it is very clear they do not have functional business continuity process in place.

    Time to look for another provider of Exchange and VoIP — this time it will be two separate vendors.

Comments and replies qualify you for special offers