The concept of a distributed database was introduced to handle large volumes of data across machines. Vertical scaling — i.e., adding more capacity to existing resources — was limited as the capacity of a resource could not be increased beyond a certain limit. However, data was growing at a higher pace and traditional systems were unable to store and retrieve data efficiently.
Thus, we had sharding!
Table of contents
Sharding involves distributing data across numerous machines. MongoDB employs sharding to handle deployments involving extensive data volumes and high-performance operations.
Horizontal scaling, also known as scale-out, refers to adding machines to share the data set and load. Horizontal scaling allows for near-limitless scaling to handle big data and intense workloads.
In MongoDB, sharding is useful when no single server can handle large, modern-day workloads. It allows you to scale the entire database horizontally. Due to digitization, there is an ever-growing demand to capture data, which poses increased load on the database. Earlier systems used to increase capabilities by adding more capacity to the existing resources, i.e., vertical scaling. However, all the resources (memory, disk, CPU) have a saturation point, after which they cannot take any more requests, and the system response time will increase, thus impacting system stability and performance. Adding further capacity will add to the cost, as well as increase the time taken to create/restore backups.
Sharding follows a shared-nothing architecture, a type of distributed computing architecture, where none of the nodes share any resources with each other. A sharded cluster splits data into multiple shards (or partitions), each shard being a replica set, i.e., having a primary and one or more secondary nodes for data redundancy and availability. Depending on the workload, the partitions can be added or removed. Each shard is responsible for handling a particular request (workload).
In MongoDB, a sharded cluster consists of a:
A shard is a replica set that contains a subset of the cluster's data. Once sharding is enabled, which happens when a single MongoDB instance is unable to handle the large dataset, MongoDB splits the desired collections into multiple shards to achieve horizontal scaling. Each shard has a primary and one or more secondary shards. The primary shard is responsible for the writes and replicates the same to the secondary shards.
The mongos instance acts as a query router for client applications, handling both read and write operations. It dispatches client requests to the relevant shards and aggregates the result from shards into a consistent client response. Clients connect to a mongos, not to individual shards.
A config server replica set consists of multiple MongoDB replica set members. They are the authoritative source of sharding metadata. The sharding metadata reflects the state and organization of the sharded data. The metadata contains the list of sharded collections, routing information, etc.
A sharded cluster will look like this:
MongoDB sharding allows horizontal scaling of your database to handle increased loads to a nearly unlimited degree by increasing read/write throughput and storage capacity. Let's look at each of those in a little more detail.
You can take advantage of parallelism by distributing the data set across multiple shards. Let's say one shard can process 1,000 operations per second. For each additional shard, you would gain an additional 1,000 operations per second in throughput.
Sharding requires the use of replica sets. Replica set enhances availability by utilizing multiple available servers. For example, a dataset D is divided into 3 subsets or shards distributed across multiple servers, S1, S2 and S3. To ensure high data availability and redundancy each shard has one or more replicas, R1, R2, R3.
Similarly, by increasing the number of shards, you can also increase overall total storage capacity. Let's say one shard can hold 4TB of data. Each additional shard would increase your total storage by 4TB. This allows near-infinite storage capacity.
Zone sharding allows you to easily create distributed databases to support geographically distributed apps, with policies enforcing data residency within specific regions. Each zone can have one or more shards.
MongoDB shards at the collection level. You choose which collection(s) you want to shard. MongoDB uses the shard key to distribute a collection's documents across shards. MongoDB splits the data into chunks, by dividing the span of shard key values into non-overlapping ranges. The maximum limit of each chunk is 128MB. MongoDB then attempts to distribute those chunks evenly among the shards in the cluster.
Here, x is the deciding range value to split the documents — for example, values of x between minimum key value (example -100) and -75 form Chunk 1, -75 to 25 form Chunk 2, and so on.
Shard keys are based on fields inside each document. The values in those fields will decide on which shard the document will reside, according to the shard ranges and amount of chunks. This data is stored and kept in the config server replica set.
The shard key value has a direct impact on the cluster's performance and should be chosen carefully. A poorly chosen shard key can lead to performance or scaling issues due to uneven chunk distribution. You can always change your data distribution strategy by changing your shard key. Use the documentation to choose the best shard key for you.
A background process known as the balancer automatically migrates chunks across the shards to ensure that each shard always has an even amount of data. The balancer runs on the primary node of the replica set. If the data in a particular shard reaches a specified threshold, the balancer migrates data between the shards. Balancer also addresses the imbalance caused when a shard is added or removed from the cluster.
Sharding strategy is the approach used by a distributed database, like MongoDB to partition data. The type of strategy used can depend on query patterns, application use cases, and data distribution patterns. For example, zone sharding is more suitable for cases where data needs to be divided based on the geographical location of customers.
MongoDB supports the following sharding strategies for distributing data across sharded clusters:
Ranged sharding divides data into ranges based on the shard key values. Each chunk is then assigned a range based on the shard key values.
A range of shard keys whose values are “close” are more likely to reside on the same chunk. This allows for targeted operations as a mongos can route the operations to only the shards that contain the required data.
The diagram in the previous section about shard keys is that of a ranged sharding.
Hashed sharding involves computing a hash of the shard key field's value. Each chunk is then assigned a range based on the hashed shard key values.
While a range of shard keys may be “close,” their hashed values are unlikely to be on the same chunk. Data distribution based on hashed values facilitates more even data distribution, especially in data sets where the shard key changes monotonically. However, hashed sharding does not provide efficient range-based operations.
Zone sharding organizes data into different zones, depending on the application requirements. For example, you may want to store data of all Europe users together, so you can create a zone which has data of all Europe customers. Each zone can be associated with one or more shards, and each shard can have data of one or more zones.
The easiest, most convenient, and most cost-effective way to deploy and manage a sharded cluster is via the database as a service that simplifies sharded cluster implementation.
To enable sharding, you need to turn it on from your MongoDB Atlas cluster:
If you want to do it yourself, follow these instructions. Make sure to replace anything between < and > with the relevant values for your setup. We will set up a config server and shard(s), both on three-node replica sets, as well as a mongos server.
Let's get started.
Each config server replica set can have any number of mongod processes (up to 50), with the following exceptions: no arbiters and no zero-priority members. For each of those, you will need to start it with the --configsvr option. For example:
mongod --configsvr --replSet <configReplSetName> --dbpath <path> --port 27019 --bind_ip localhost,<hostname(s)|ip address(es)>
From there, connect to just one of the replica set members:
mongosh --host <hostname> --port 27019
And run rs.initiate() on just one of the replica set members:
rs.initiate(
{
_id: "<configReplSetName>",
configsvr: true,
members: [
{ _id : 0, host : "<cfg1.example.net:27019>" },
{ _id : 1, host : "<cfg2.example.net:27019>" },
{ _id : 2, host : "<cfg3.example.net:27019>" }
]
}
)
With the config server replica set up and running, we can create the shards. Visit the official MongoDB documentation page for more details on Replica Set configuration.
As mentioned before, each shard is a replica set in and of itself. This process will be similar to the config servers but using the --shardsvr option. Make sure to use a different replica set name for each shard.
mongod --shardsvr --replSet <shardReplicaSetNamereplSetname> --dbpath <path> --port 27018 --bind_ip <cluster hostname(s)|ip address(es)>
From there, connect to just one of the replica set members:
mongosh --host <hostname> --port 27018
And run rs.initiate() on just one of the replica set members. Make sure you leave out the --configsvr option:
rs.initiate(
{
_id: "<shardReplicaSetNamereplSetnamereplSetName>",
members: [
{ _id : 0, host : "<shard-host1.example.net:2701827019>" },
{ _id : 1, host : "<shard-host2.example.net:2701827019>" },
{ _id : 2, host : "<shard-host3.example.net:2701827019>" }
]
}
)
Finally, set up the mongos and point it at your config servers replica set:
mongos --configdb
<configReplSetName>/<cfg1.example.net:27019>,<cfg2.example.net:27019>,<cfg3.example.net:27019> --bind_ip localhost,<cluster hostname(s)|ip address(es)>
In a production environment, more than one mongos must be deployed to avoid a bottleneck. Usually, starting with at least three mongos instances is a good practice.
Connect to your mongos:
mongosh --host <hostname> --port <port>
And add your shards to the cluster. Do this once for each shard:
sh.addShard( "<shardReplicaSetName>/<shard-host1.example.net:27018>,<shard-host2.example.net:27018>,<shard-host3.example.net:27018>")
Enable sharding on your database:
sh.enableSharding("<database>")
Finally, shard your collection using the sh.shardCollection() method. You can do this either via hashed sharding, which will spread your data evenly across shards…
sh.shardCollection("<database>.<collection>", { <shard key field> : "hashed" , ... } )
...or via range-based sharding, which allows you to optimize distribution across shards based on the shard key values. For certain sets of data, this will make queries across ranges of data more efficient. The command is as follows:
sh.shardCollection("<database>.<collection>", { <shard key field> : 1, ... } )
And that's it! You've now set up your first sharded cluster. Any application interaction from now on should be done via the routers (mongos instances) only.
Check out the sharding manual or sharding methods in the MongoDB documentation.
Shard keys are based on fields inside each document. The values in those fields will decide on which shard the document will reside, according to the shard ranges and amount of chunks. This data is stored and kept in the config server replica set.
Each shard key will require an index or auto-build one, as part of the sharding command, if the collection is empty and the required index is missing. The shard key has a direct impact on the cluster performance and could lead to bottlenecks for your application access patterns. Use the documentation to choose the best shard key for you.