Hello,
at my company we have a mongo collection that has reached 120 million records that is currently sharded. Sadly and painfully, the chosen sharding strategy has been shown to be really bad: suffers both from the scatter/gather pattern and is monotonically increasing, meaning that we experience a terrible hot partition.
I have been asked to try and fix this, after a bit of reading and watching resources on mongo sharding, I came to a possible solution, but I am not sure if this would work.
Our data model is something like this:
{
"containerID": <int>,
"uniqueID": <string>,
"creationTimestamp": <timestamp>,
"freeText": <string>,
"user": <string>,
"dataset": <int>
}
Let me give as much as possible information on the current collection:
- it has 120 million records;
- size is 1 terabyte;
- each document has a weight of [2, 200] kb, but I estimate the major part of it being close to 10/20 kb;
- containerID and dataset are external keys monotonically increasing;
- uniqueID is a random id generated.
About our query patterns, we search with the following use cases:
- Using containerID giving a list of values, E.g. [1, 2, 200]. This field represents a relational key into another database that groups our mongodb documents together (somehow a foreign key);
- we use a single containerID value sorting by creationTimestamp and paginating. This is the most frequent query in my experience;
- We get all the documents querying with dataset (it has a one to many relationship with containerID, so one dataset has multiple containerIDs associated) sorting by creationTimestamp and paginating.
I investigated the containerID field, and I discovered that we have for each key a range of (1, 10.000] values. The average is 200-300 values I think, but it is not a complete even distribution.
My proposed solution would be to add a field in my data model, hashedContainerID that I would populate with my application code that is a hash for my containerID, and use as a shard key {hashedContainerID, creationTimestamp}.
In this way, I should avoid the problem of monotonically increasing key (because containerID has been hashed) and enriching the creationTimestamp in the shardKey I should be able to avoid jumbo blocks because further splitting by mongo could still happen for a single containerID.
- Is there any pitfall in this that I am not seeing?
- Can this scale up well allowing good performance in writing and at the same time avoid the scatter-gather pattern for the queries using containerID especially?
Thank you for reading this far!