Replica Set Elections
Replica sets use elections to determine which set member will become primary. Replica sets can trigger an election in response to a variety of events, such as:
Adding a new node to the replica set,
performing replica set maintenance using methods such as
rs.stepDown()
orrs.reconfig()
, andthe secondary members losing connectivity to the primary for more than the configured
timeout
(10 seconds by default).
In the following diagram, the primary node was unavailable for longer
than the configured timeout
and triggers the automatic failover
process. One of the remaining secondaries calls for an election to
select a new primary and automatically resume normal operations.
The replica set cannot process write operations until the election completes successfully. The replica set can continue to serve read queries if such queries are configured to run on secondaries.
The median time before a cluster elects a new primary should not
typically exceed 12 seconds, assuming default replica
configuration settings
. This includes time required to
mark the primary as unavailable and
call and complete an election.
You can tune this time period by modifying the
settings.electionTimeoutMillis
replication configuration
option. Factors such as network latency may extend the time required
for replica set elections to complete, which in turn affects the amount
of time your cluster may operate without a primary. These factors are
dependent on your particular cluster architecture.
Your application connection logic should include tolerance for automatic failovers and the subsequent elections. MongoDB drivers can detect the loss of the primary and automatically retry certain write operations a single time, providing additional built-in handling of automatic failovers and elections:
Compatible drivers enable retryable writes by default
Factors and Conditions that Affect Elections
Replication Election Protocol
Replication protocolVersion: 1
reduces
replica set failover time and accelerate the detection of multiple
simultaneous primaries.
You can use catchUpTimeoutMillis
to prioritize between
faster failovers and preservation of w:1
writes.
For more information on pv1
, see
Self-Managed Replica Set Protocol Version.
Heartbeats
Replica set members send heartbeats (pings) to each other every two seconds. If a heartbeat does not return within 10 seconds, the other members mark the delinquent member as inaccessible.
Member Priority
After a replica set has a stable primary, the election algorithm will
make a "best-effort" attempt to have the secondary with the highest
priority
available call an election.
Member priority affects both the timing and the
outcome of elections; secondaries with higher priority call elections
relatively sooner than secondaries with lower
priority, and are also more likely to win. However, a lower priority
instance can be elected as primary for brief periods, even if a higher
priority secondary is available. Replica set members continue
to call elections until the highest priority member available becomes
primary.
Members with a priority value of 0
cannot become primary and do
not seek election. For details, see
Priority 0 Replica Set Members.
Mirrored Reads
MongoDB provides mirrored reads to pre-warm electable secondary members' cache with the most recently accessed data. With mirrored reads, the primary can mirror a subset of operations that it receives and send them to a subset of electable secondaries. Pre-warming the cache of a secondary can help restore performance more quickly after an election.
For details, see Mirrored Reads.
Loss of a Data Center
With a distributed replica set, the loss of a data center may affect the ability of the remaining members in other data center or data centers to elect a primary.
If possible, distribute the replica set members across data centers to maximize the likelihood that even with a loss of a data center, one of the remaining replica set members can become the new primary.
Network Partition
A network partition may segregate a primary into a partition
with a minority of nodes. When the primary detects that it can only see
a minority of nodes in the replica set, the primary steps down as
primary and becomes a secondary. Independently, a member in the
partition that can communicate with a majority
of the nodes (including itself)
holds an election to become the new primary.
Voting Members
The replica set member configuration setting members[n].votes
and member state
determine whether a
member votes in an election.
All replica set members that have their
members[n].votes
setting equal to 1 vote in elections. To exclude a member from voting in an election, change the value of the member'smembers[n].votes
configuration to0
.Only voting members in the following states are eligible to vote:
Non-Voting Members
Although non-voting members do not vote in elections, these members hold copies of the replica set's data and can accept read operations from client applications.
Because a replica set can have up to 50 members, but only 7 voting members, non-voting members allow a replica set to have more than seven members.
Non-voting (i.e. votes
is 0
) members must have
priority
of 0.
For instance, the following nine-member replica set has seven voting members and two non-voting members.
A non-voting member has both votes
and
priority
equal to 0
:
{ "_id" : <num>, "host" : <hostname:port>, "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 0, "tags" : { }, "secondaryDelaySecs" : NumberLong(0), "votes" : 0 }
Important
Do not alter the number of votes to control which
members will become primary. Instead, modify the
members[n].priority
option. Only
alter the number of votes in exceptional cases. For example, to
permit more than seven members.
To configure a non-voting member, see Configure a Non-Voting Self-Managed Replica Set Member.