Guidance for Atlas Disaster Recovery
On this page
It is critical for enterprises to plan for disaster recovery. We strongly recommend that you prepare a comprehensive disaster recovery (DR) plan that includes elements such as:
Your designated recovery point objective (RPO)
Your designated recovery time objective (RTO)
Automated processes that facilitate alignment with these objectives
Use the recommendations on this page to prepare for and respond to disasters.
To learn more about proactive high availability configurations that can help with disaster recovery, see Recommended Configurations for High Availability and Recovery.
Features for Atlas Disaster Recovery
To learn about Atlas features that support disaster recovery, see the following pages in the Atlas Architecture Center:
Recommendations for Atlas Disaster Recovery
Use the following disaster recovery recommendations to create a DR plan for your organization. These recommendations provide information on steps to take in the event of a disaster.
It is imperative that you test the plans in this section regularly (ideally quarterly, but at least semi-annually). Testing often helps prepare the Enterprise Database Management (EDM) team to respond to disasters while also helping to keep the instructions up to date.
Some disaster recovery testing might require actions that cannot be performed by EDM users. In these cases, open a support case for the purpose of performing artificial outages at least a week in advance of when you plan on running a test exercise.
This section covers the following disaster recovery procedures:
Single Node Outage
If a single node in your replica set fails due to a regional outage, your deployment should still be available, assuming you have followed best practices. If you are reading from secondaries, you might experience degraded performance because you have one less node to read from.
You can test a primary node outage in Atlas using the Atlas UI's Test Primary Failover feature or the Test Failover Atlas Administration API endpoint.
Regional Outage
If a single region outage or multi-region outage degrades the state of your cluster, follow these steps:
Determine which regions are unlikely to be affected by the current outage
Depending on the cause of the outage, there may be additional regions in the near future that will also experience unscheduled outages. For example, if the outages were caused by a natural disaster on the east coast of the United States, you should avoid regions on the east coast of the United States in case there are additional issues.
Add nodes to the regions you identified
Add the required number of nodes for a normal state across regions that are unlikely to be affected by the cause of the outage.
To reconfigure a replica set during an outage by adding regions or nodes, see Reconfigure a Replica Set During a Regional Outage.
You can test a region outage in Atlas using the Atlas UI's Simulate Outage feature or the Start an Outage Simulation Atlas Administration API endpoint.
Cloud Provider Outage
In the highly unlikely event that an entire cloud provider is unavailable, follow these steps to bring your deployment back online:
Identify the alternative cloud provider you would like to deploy your new cluster on
For a list of cloud providers and information, see Cloud Providers.
Find the most recent available snapshot taken of the cluster before the outage began
To learn how to view your backup snapshots, see View M10+ Backup Snapshots.
Restore the most recent snapshot from the previous step into the new cluster
To learn how to restore your snapshot, see Restore Your Cluster.
Switch any applications that connect to the old cluster to the newly-created cluster
To find the new connection string, see Connect via Drivers. Review your application stack as you likely need to redeploy it onto the new cloud provider.
Atlas Outage
In the highly unlikely event that the Atlas Control Plane and the Atlas UI are unavailable, your cluster is still available and accessible. To learn more, see Platform Reliability. Open a high-priority support ticket to investigate this further.
Resource Capacity Issues
Computational resource (such as disk space, RAM, or CPU) capacity issues can result from poor planning or unexpected database traffic. This behavior might not be a result of a disaster.
If a computational resource reaches the maximum allocated amount and causes a disaster, follow these steps:
Identify which computational resource is maxed out by using the Real Time Performance Panel or Atlas metrics
To view your resource utilization in the Atlas UI, see Monitor Real-Time Performance.
To view metrics with the Atlas Administration API, see Monitoring and Logs.
Allocate the necessary resources
Note that Atlas will perform this change in a rolling manner, so it should not have any major impact on your applications.
To learn how to allocate more resources, see Edit a Cluster.
Resource Failure
Important
This is a temporary solution intended to shorten overall system downtime. Once the underlying issue resolves, merge the data from the newly-created cluster into the original cluster and point all applications back to the original cluster.
If a computational resource fails and causes your cluster to become unavailable, follow these steps:
Open a high-priority support ticket
Restore the most recent backup into the newly-created cluster
To learn how to restore your snapshot, see Restore Your Cluster.
Deletion of Production Data
Production data might be accidentally deleted due to human error or a bug in the application built on top of the database. If the cluster itself was accidentally deleted, Atlas might retain the volume temporarily.
If the contents of a collection or database have been deleted, follow these steps to restore your data:
Create a copy of the current state of the collection or database, if it contains any data
You can use mongoexport to create a copy.
Restore your data
If the deletion occurred within the last 72 hours, and you configured continuous backup, use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.
If the deletion did not occur in the past 72 hours, restore the most recent backup from before the deletion occurred into the cluster.
To learn more, see Restore Your Cluster.
If you created a copy of your data, import the new data you exported
You can use mongoimport with upsert mode to import your data and ensure that any data that was modified or added is reflected properly in the collection or database.
Driver Failure
If a driver fails, follow these steps:
Data Corruption
Important
This is a temporary solution intended to shorten overall system downtime. Once the underlying issue resolves, merge the data from the newly-created cluster into the original cluster and point all applications back to the original cluster.
If your underlying data becomes corrupted, follow these steps:
Open a high-priority support ticket
Restore the most recent backup into the newly-created cluster
To learn how to restore your snapshot, see Restore Your Cluster.