A.17 Resolver Disaster Recovery Plan

1. Introduction

Resolver’s Hosted Platform Disaster Recovery Plan (DRP) encompasses the applications, application environment, network, and data communications infrastructure that is involved in the product. Any event that has a negative impact on a company’s business continuity or finances could be termed a disaster. This includes hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other significant event. In order to mitigate the risk of a disaster caused by natural, man-made, or acts of God, the company has developed a detailed Hosted DRP. The plan includes strategies and efforts that the company’s technical and management personnel will need to perform before, during, and after a disruption occurs.

2. Purpose and scope

The DRP details guidelines for declaring a disaster involving various teams, which encompasses roles and responsibilities specifically outlining each team’s assigned responsibilities.

The purpose of this policy is to protect the confidentiality, integrity, and availability of Resolver’s and its customer’s information by controlling remote access to Resolver’s IT systems.

The Hosted DRP describes the step-by-step process on how to recover from the loss of Resolver’s Platform. It includes guidelines of priority of work to ensure that applications and systems are recovered in a timely fashion.

The purpose of the DRP is to define precisely how Resolver will recover its IT infrastructure, IT services, and all data (including personal data) within set deadlines in the case of a disaster or other disruptive incident. The objective of this Plan is to complete the recovery of IT infrastructure, IT services, and data within the set recovery time objective (RTO).

This Plan includes all resources and processes necessary for the recovery and covers all the information security aspects of business continuity management.

Users of this document are members of the top management and employees necessary for the recovery of this activity.

  • All AWS Hosted Production Environments.

3. Recovery Time Objective and Recovery Point Objective definition

This whitepaper uses two common industry terms for disaster planning:

Recovery time objective (RTO) — The time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.

Recovery point objective (RPO) — The acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon).
A company typically decides on an acceptable RTO and RPO based on the financial impact to the business when systems are unavailable. The company determines financial impact by considering many factors, such as the loss of business and damage to its reputation due to downtime and the lack of systems availability.
IT organizations then plan solutions to provide cost-effective system recovery based on the RPO within the timeline and the service level established by the RTO.

4. Reference documents

  • ISO/IEC 27001:2013 standard, control A.17

5. Background

  • A reliable and regularly tested backup strategy is the cornerstone of Resolver’s Disaster recovery strategy.

5.1. Database Audit and Data Collection

The DRP team ascertains and maintains that the database data is being backed up on a daily basis. The backup and restoration of all databases are regularly tested.

5.2. Network Audit

All Resolver’s Cloud Platform applications are enclosed in Amazon AWS Data Centers.

5.3. Final Review

A Resolver DRP was developed to map each team’s tasks and the interdependencies of those tasks.

The DRP is reviewed and tested on an annual basis to keep the plan in sync with current business and Resolver’s Platform environment needs.

6. Guidelines for Declaring a Disaster

A disaster will be declared if Resolver’s Platform is inaccessible for a period of four hours consecutive or Resolver management believes Resolver’s Platform will be unavailable for a twenty-four (24) hour period. Declaring a Disaster Recovery Event is a serious step and a conservative approach will be taken when a decision is required.

If the Resolver’s Platform production facility is destroyed, a disaster will be declared immediately.

Declaring a disaster is the responsibility of the Development Operations (DevOps) team and the COO.

7. Roles and Responsibilities

7.1. Disaster Recovery Team and Executive

The COO is the responsible Disaster Recovery Executive. In the COO’s absence, VP of Engineering and/or CEO are authorized to make decisions on behalf of Resolver. The DevOps team member will advise the Disaster Recovery Executive about the potential Disaster Recovery situation. This person will determine if an incident should be classified as a Disaster Recovery event, which will put the Hosted Disaster Recovery Plan into effect. Furthermore, he or she will notify the Director of the Customer Service group and the EVP of Operation that a Disaster Recovery Event has been declared. The Director of the Customer Service will determine the information that will be communicated to customers.

Key Personnel Contact Info (Call tree)

Name and TitleContact detailsContact Numbers
BPC CoordinatorMobile phone
Work phone
Email Address
Alternative email
BCP CoordinatorMobile phone
Work phone
Email Address
VP Customer Success, Communications Team LeadMobile phone
Work phone
Email Address
Alternative email
CTOMobile phone
Work phone
Email Address
DevOps DirectorMobile phone
Work phone
Email Address
Information Security AnalystMobile phone
Work phone
Email Address

7.2. Manager of DevOps

The Manager of DevOps has overall responsibility to ensure that the Hosted Disaster Recovery Plan and Disaster Recovery environment are properly maintained and tested. This person is also the leader of the Disaster Recovery Team and will lead the Disaster Recovery Team in implementing the Hosted Disaster Recovery Plan.

7.3. DevOps team

If a potential Disaster Recovery Event occurs after hours the On-Call DevOps team member is responsible for identifying a possible incident and following the escalation process.

7.4. Hosting Personnel

The hosting personnel located at the Production Site(s) will assist in assessing the impact of the incident and reporting this information to Resolver.

7.5. Communicating During a Disaster

The Communications Teams will be responsible to ensure that all stakeholders will be regularly updated, every 4 hours.

7.5.1. Communicating with Employees

The Communications Teams will be responsible to ensure that the entire company has been notified of the disaster. The best and/or most practical means of contacting all of the employees will be used with a preference on the following methods (in order):

  • Slack
  • SMS to employee cell phone
  • Via corporate E-mail
  • Telephone to employee cell phone

7.5.2. Communicating with Customers

The Director of the Customer Service is responsible for notifying customers of the declaration of a Disaster Recovery Event. The Director of Customer Service will inform customers of the nature of the disaster and estimated time to recovery. There will be regular communications (every 4-6 hours) to customers regarding the status and progress for the duration of the Disaster Recovery Event.

7.5.3. Communicating with Media

The Public Relations Representative (PR Rep) is a member of the executive staff or their assigned representative. This person is the only authorized Resolver personnel that is permitted to give any statement to the media.

Important Note- All Disaster Recovery team members will refer members of the media to the CEO and CRO.

8. Disaster Recovery Process Prerequisites

The disaster recovery process consists of defining rules, processes, and disciplines to ensure that the critical business processes will continue to function if there is a failure of one or more of the information processing or telecommunications resources upon which their operations depends. The following are key elements to a disaster recovery plan:

  • Create a list of key assets for each system
  • Perform risk assessment and audits of the assets
  • Establish priorities for applications and networks
  • Develop recovery strategies
  • Prepare inventory and documentation of the plan
  • Develop verification criteria and procedures
  • Implement the plan

Key people from each business unit should be members of the team and included in all disaster recovery planning activities. The disaster recovery planning group needs to understand the business processes, technology, networks, and systems in order to create a DRP. A risk and business impact analysis should be prepared by the disaster recovery planning group that includes at least the top ten potential disasters. After analyzing the potential risks, priority levels should be assigned to each business process and application/system.

It is important to keep inventory up-to-date and have a complete list of equipment; physical and virtual, 3rd party services from which our products depend, locations, and points of contact.

The goal is to provide viable, effective, and economical recovery across all technology domains.

Each product group should create a list of assets and can use the following chart to classify them:

ClassificationDescription
1Mission CriticalMission Critical to accomplishing the mission of the organization.

It can be performed only by computers.

No alternative manual processing capability exists.
It must be restored within 36 hours.

2CriticalCritical in accomplishing the work of the organization.

Primarily performed by computers.

It can be performed manually for a limited time period.

Must be restored starting at 36 hours and within 5 days.

3EssentialEssential in completing the work of the organization. Performed by computers.

It can be performed manually for an extended time period.

It can be restored as early as 5 days, however, it can take longer.

4Non-CriticalNon-Critical to accomplishing the mission of the organization.

It can be delayed until the damaged site is restored and/or a new computer system is purchased.

It can be performed manually.

9. Disaster Recovery Site

  • All of Resolver’s shared production environments components are redundant.

10. Disaster Recovery Process

10.1. Notification Phase

This phase includes the activities to notify Disaster Recovery Executive of a possible disaster, directing the Disaster Recovery Team to assess the damages to the Resolver’s Platform, and beginning the Disaster Recovery process if necessary.

10.1.1. Disaster Recovery Procedure

  1. The DevOps Team member will determine if possible, the nature and impact of the incident:
    1. Open AWS Service Health Dashboard to identify the status of key valued AWS services impacting for Resolver’s production environments functionality:
      1. AWS Elastic Compute Cloud (EC2)
      2. AWS Elastic Block Store (EBS)
      3. Amazon Elastic Container Service (ECS)
      4. Amazon Elastic Container Registry (ECR)
      5. Amazon Simple Storage Service (S3)
      6. AWS Virtual Private Cloud (VPC)
      7. AWS VPN Gateway
      8. AWS Regions
      9. AWS Availability Zones (AZ)
      10. AWS Key Management Service (KMS)
      11. AWS Identity and Access Management (IAM)
      12. AWS Elastic Load Balancing (ELB) (layer 4)
      13. AWS Application Load Balancer (ALB) (layer 7)
      14. AWS Web Application Firewall (WAF)
      15. AWS Route 53 (DNS)
      16. AWS CloudFront
      17. Amazon RDS for PostgreSQL
      18. Amazon Elasticsearch Service
      19. Amazon ElastiCache
      20. AWS Simple Email Service (SES)
      21. Amazon CloudWatch
      22. AWS VPC Flow Logs
      23. AWS Config
      24. AWS GuardDuty
      25. AWS Certificate Manager (ACM)
      26. AWS CloudTrail
      27. AWS Shield
      28. AWS Lambda

2. The DevOps Team member will notify key personnel of the incident.  At a minimum, the following personnel must be notified:

  • CEO
  • COO
  • VP Customer Success

The DevOps Team member will include the following notification information if applicable:

  • Nature of the emergency
  • Loss of life or injuries
  • Known damage estimates

3. The Disaster Recovery Executive will determine how to proceed. The following actions may be taken:

a)     Require the Disaster Recovery Team to conduct a further damage assessment.  Information on the following items will be reported back to the Disaster Recovery Executive hourly:

  • Cause of the emergency or disruption
  • Potential for additional disruptions
  • Physical Infrastructure status
  • Items to be replaced
  • Estimated time to restore to normal services

b)    Determine the extent of the incident and is it considered a Disaster Recovery Event as defined by the company’s Disaster Recovery Guidelines.  If so, the Disaster Recovery Executive will instruct the Disaster Recovery Team to activate the Hosted Disaster Recovery Plan.

10.2. Activation Phase

This phase includes the activities that initiate the Disaster Recovery Event.  Team members are notified, assembled, and updated on the present situation.

10.2.1. Procedure

The DevOps Team member will contact the COO. If not reachable, the DevOps Team member will contact the Company CTO.  If not reachable, the DevOps Engineer will contact the CEO.

  1. The Disaster Recovery Executive will contact the Director of Customer Service to brief him/her on the present situation.
  2. The Director of Customer Service will follow departmental procedures, contact the impacted customers, and provide them with information on the estimated time to recovery.
  3. The Disaster Recovery Team will make an assessment of the damage at the Production Site, estimated time to recovery, and priority of work.
  4. If required, DevOps will make contact with our service providers/vendors to assist and help assess the situation.
  5. The Disaster Recovery Executive will make a determination to declare the situation a Disaster Recovery Event.
  6. The Director of Customer Success will notify all customers about the Disaster Recovery Event.
  7. The Disaster Recovery Team begins the cutover to the Disaster Recovery site.

10.3. Recovery Phase

The recovery phase involves steps to be taken to restore Resolver’s Platform to be recovered to the Disaster Recovery Site.

The initial focus in bringing up the Disaster Recovery environment as Production is to ensure that the data is as current as possible and to determine how much data loss there is between the most recent Production-Disaster Recovery data synching and the time Production was lost.

In parallel to the data recovery effort, DevOps will be working to re-point customers, and start pushing the new Domain Name entries across the Internet.  Lastly, when the process is under control, DevOps will work to find a new site for Production if that is necessary.

It is important, throughout the entire Disaster Recovery process; to have the engineering leads available. Work to get the Resolver’s Platform live will be performed around the clock.

10.3.1. Disaster Recovery Step by Step Procedure

  1. The Director of Customer Service will contact customers and inform them that Resolver is in the process of moving Resolver’s Platform to the Disaster Recovery Site.
  2. The DevOps Team member of the Disaster Recovery Team will post all data updates from Resolver’s Platform to Disaster Recovery.
  3. The DevOps Team member will bring the databases on-line in the Disaster Recovery environment.
  4. The DevOps Team member will verify that all servers are secure and up-to-date.
  5. The DevOps Team member will ensure the standard outages pages are in place.
  6. Disaster Recovery Team will work with QA to conduct functionality tests to ensure the Disaster Recovery Site is operational.
  7. Production is cut over to the Disaster Recovery Site.
  8. Disaster Recovery Executive will contact the Director of Customer Service and inform him/her that the Production Site has been successfully failed over to the Disaster Recovery Site.
  9. The Director of Customer Service will contact customers and inform them that Resolver has successfully moved Production to the Disaster Recovery Site.
  10. The Disaster Recovery Executive declares the disaster is over once Resolver’s Platform’s Production Site (original or alternate) is running at 100% operational capacity.
  11. The Director of Customer Success (if necessary) will locate a replacement Production Hosted Data Center.
  12. The Risk Management Group will work with Resolver’s insurance vendor to recoup the funds to assist with paying for replacement equipment (if necessary).

11. Recovery time objective (RTO)

Resolver’s AWS Hosted Production Environments RTO should be no more than 15 minutes in the scope of the AWS Region.

12. Recovery point objective (RPO)

Resolver’s AWS Hosted Production Environments RPO – potential data loss will not be more than one (1) hour

13. Disaster Recovery Annual Test Plan

Annually re-occurring for January and incorporates a full mock fail-over up to but not including the point of re-directing customers.

14. Hosted Disaster Recovery Plan Maintenance Process

The Hosted Disaster Recovery Plan and Disaster Recovery environment will be modified in response to changes in the Resolver’s Platform environment.   Such changes might include personnel changes, critical application changes, and network, hardware, or software changes.  Resolver’s Platform Hosted Disaster Recovery Plan is tested annually to ensure that Resolver has the appropriate environment to support Disaster Recovery at 100% capacity.

The Disaster Recovery test is designed to ensure that Disaster Recovery data is in sync with Production data and the Disaster Recovery applications function the same as Production applications.

15. Validity and document management

This document is valid as of July 2020.

The owner of this document is an Information Security Analyst who must check and, if necessary, update the document at least once a year.

When evaluating the effectiveness and adequacy of this document, the following criteria need to be considered:

  • The number of incidents related to taking a portable device outside the organization’s premises without authorization.
  • The number of incidents related to unauthorized access to the portable device outside the organization’s premises.

EFFECTIVE ON: September 2020
REVIEW CYCLE: Annual at least and as needed
REVIEW, APPROVAL & CHANGE HISTORY: Last time reviewed and approved in August 2020 by Resolver’s Information Technology Security team.