CivicActions Common Contingency Plan

Table of Contents

Applicability

Note: This Contingency Plan applies only to systems for which CivicActions has negotiated and defined Incident Response/Contingency Plan (IRCP) operations. Each IRCP-managed system will have a specific, tailored version of this Contingency Plan or in some cases a completely unique Contingency Plan will be developed. All CivicActions employees are aware of the procedures outlined herein.

Overview

This Contingency Plan provides baseline guidance for the CivicActions Team when managing the disruption, compromise, or failure of any component of a CivicActions IRCP managed system, product or service ("system"). As a general guideline, we consider "disruption" to mean unexpected downtime or significantly reduced service lasting longer than:

  • 30 minutes 0900 - 2100 Eastern Time Monday through Friday (standard U.S. business hours)
  • 90 minutes at other times

Scenarios where that could happen include unexpected downtime of key services, system data loss, or improper privilege escalation. In the case of a security incident, the team uses the Security Incident Response Plan as well.

Some clients will create and maintain a Contingency Plan defining procedures specfic to their system. In such a case, the client-specific Contingency Plan takes precedence.

Recovery objective

Short-term disruptions lasting less than 30 minutes are outside the scope of this plan.

More than 3 hours of any system being offline during standard U.S. business hours (0900 - 2100 Eastern Time) is considered unacceptable. Our objective is to recover from any significant problem (disruption, compromise, or failure) within that span of time.

Incident Response Team information

Contact information

Team contact information is available in the Google Drive:

Contingency plan outline

Activation and notification

The first Incident Response Team member who notices or reports a potential contingency-plan-level problem becomes the Incident Commander (IC) until recovery efforts are complete or the Incident Commander role is explicitly reassigned.

If the problem is identified as part of a security incident response situation (or becomes a security incident response situation), the same Incident Commander (IC) should handle the overall situation, since these response processes must be coordinated.

The IC first notifies and coordinates with the people who are authorized to decide that the system is in a contingency plan situation:

  • From CivicActions:
    • Incident Commander
    • Project Manager
    • CivicActions Incident Response Team
  • From the customer:
    • Product Owner
    • Users, when applicable

The IC keeps a log of the situation in the #general Slack channel or within a client-specific Slack channel, JIRA ticket, or GitHub issue. If this is also a security incident, the IC also follows the security incident communications process. The IC should delegate assistant ICs for aspects of the situation as necessary.

Recovery

The Incident Response Team assesses the situation and works to recover the system. See the list of external dependencies for procedures for recovery from problems with external services.

If this is also a security incident, the IC also follows the security incident assessment and remediation processes.

If the IC assesses that the overall response process is likely to last longer than 3 hours, the IC should organize shifts so that each responder works on response for no longer than 3 hours at a time, including handing off their own responsibility to a new IC after 3 hours.

Reconstitution

The Incident Response Team tests and validates the system as operational.

The Incident Commander declares that recovery efforts are complete and notifies all relevant people. The last step is to schedule a postmortem to discuss the event. This is the same as the security incident retrospective process.

External dependencies

CivicActions managed systems often depend on several external services. In the event one or more of these services has a long-term disruption, the team will mitigate impact by following this plan. Zero or more of the following services may be involved:

GitHub

If GitHub becomes unavailable, DKAN/HDG will continue to operate in its current state. The disruption would only impact the team's ability to update code on the instances.

GitLab

If GitLab becomes unavailable, GlobalNET will continue to operate in its current state. The disruption would only impact the team's ability to update code on the instances.

StatusCake

If there is a disruption in the StatusCake service, the Incident Response team will be notified by email.

OpsGenie

If there is a disruption in the OpsGenie service, all alerts automatically get delivered to the team via email.

JIRA

There is no direct impact to the platform if a disruption occurs. Primary incident communications will move to the #globalnet Slack channel.

Slack

There is no direct impact to the platform if a disruption occurs. Primary incident communications will move to one of (try in order):

CPM

The Cloud Protection Manager (CPM) provides backup and restore services. There is no direct impact to the platform if a disruption occurs.

AWS

If needed, you can manage and create new servers.

In case of a significant disruption, after receiving approval from our Authorizing Official, the CivicActions team will deploy a new instance of the entire system to a different region.

Acquia Cloud Enterprise (ACE) Platform as a Service (PaaS)

DKAN/HDG is hosted on the Acquia Cloud Enterprise (ACE) PaaS https://cloud.acquia.com/app/develop which is layered on top of the Amazon Web Services (AWS) FedRAMP-certified cloud in the us-east region. See ACE Status and AWS status.

Acquia Cloud takes hourly snapshots of EBS volumes that are saved to Amazon S3 providing geographically distributed data centers.

In case of a significant disruption, after receiving approval from our Authorizing Official, the CivicActions and Acquia teams will deploy a new instance of the entire system to a different region.

How this document works

This plan is most effective if all CivicActions team members know about it, remember that it exists, have the ongoing opportunity to give input based on their expertise, and keep it up to date.

  • The CivicActions team is responsible for maintaining this document and updating it as needed. Any change to it must be approved and peer reviewed by at least another member of the team.
    • All changes to the plan should be communicated to the rest of the team.
    • At least once a year, and after major changes to our systems, we review and update the plan.
  • How we protect this plan from unauthorized modification:
    • This plan is stored in the CivicActions Handbook GitHub repository (https://github.com/CivicActions/handbook/tree/master/docs/09-security) with authorization to modify it limited to the Incident Response Team by GitHub access controls. CivicActions policy is that changes are proposed by making a pull request and ask another team member to review and merge the pull request.