Simulate Regional Outage
On this page
Note
This feature is not available for any of the following deployments:
Serverless instances
M0
clustersM2/M5
clustersFlex clusters
To learn more, see Limits.
You can use the Atlas UI and API to simulate an outage on your Atlas multi-region cluster and observe how your application handles an outage in one or more regions. You can also run multiple simulations. When running multiple simulations, we recommend a five minute interval between simulations.
Required Access
To start an outage simulation, you must have
Organization Owner
or Project Owner
access to
the project.
Simulate Regional Outage Process
When you submit a request to test an outage using the Atlas UI or API, Atlas simulates an outage event. During a simulated outage, Atlas:
Removes network connectivity to nodes in the selected regions.
Does not trigger a monitoring alert for
Replica set has no primary
.
If your application takes more than 15 minutes to notice connection loss to some nodes, we recommend that you reduce your TCP retransmission timeout values. To learn more, see modify tcp_retries2 value.
Simulate Regional Outage Using the Atlas UI
To simulate a Regional Outage in the Atlas UI:
In Atlas, go to the Clusters page for your project.
If it's not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.
If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
If it's not already displayed, click Clusters in the sidebar.
The Clusters page displays.
Go to the Test Resilience modal.
For the cluster you wish to perform outage testing, click the ... button.
Click Test Resilience.
Select Regional Outage. Atlas displays a Test Resilience modal with the steps Atlas takes to simulate an outage event. To learn more, see Simulate Regional Outage Process.
Begin the simulation.
Click Select Regions.
Select the tab corresponding to the type of outage you want to simulate:
Select fewer than half of your electable nodes.
Select at least one more than half of your electable nodes and keep at least one electable node remaining.
After selecting a majority of your electable nodes, your replica set won't have a primary node. This means that your replica set can't perform write operations and read operations that are not configured with a suitable readPreference.
Select Simulate Regional Outage to begin the test.
Atlas notifies you when the outage occurs.
End the simulation.
Select a tab corresponding to the type of outage you are performing:
When you finish testing the outage, click End Simulation.
When you finish testing the regional outage, you can perform one of the following:
Restore your multi-region cluster by adding electable nodes to your replica set. To learn more, see Reconfigure a Replica Set During a Regional Outage.
End the regional outage simulation by clicking End Simulation.
Simulate Regional Outage Using the API
You can use the Test Outage API endpoint to simulate an outage event. To learn more about the outage process, see Simulate Regional Outage Process.
Verify the Outage
To verify that the outage is successful, monitor your application and ensure your read and write operations are working as expected.
Troubleshoot Outage
A regional outage or regional outage simulation that affects the highest priority regions in a sharded cluster could cause the cluster to become inoperable for read operations. To restore the config servers, do the following:
Configure a read preference that is suitable for querying secondary nodes for reads.
Reconfigure the cluster for regaining electable nodes.