Getting a good shard key for mongodb

StefanoF · September 11, 2024, 7:15pm

Hello,
at my company we have a mongo collection that has reached 120 million records that is currently sharded. Sadly and painfully, the chosen sharding strategy has been shown to be really bad: suffers both from the scatter/gather pattern and is monotonically increasing, meaning that we experience a terrible hot partition.
I have been asked to try and fix this, after a bit of reading and watching resources on mongo sharding, I came to a possible solution, but I am not sure if this would work.

Our data model is something like this:

{
	"containerID": <int>,
	"uniqueID": <string>,
	"creationTimestamp": <timestamp>,
	"freeText": <string>,
	"user": <string>,
	"dataset": <int>
}

Let me give as much as possible information on the current collection:

it has 120 million records;
size is 1 terabyte;
each document has a weight of [2, 200] kb, but I estimate the major part of it being close to 10/20 kb;
containerID and dataset are external keys monotonically increasing;
uniqueID is a random id generated.

About our query patterns, we search with the following use cases:

Using containerID giving a list of values, E.g. [1, 2, 200]. This field represents a relational key into another database that groups our mongodb documents together (somehow a foreign key);
we use a single containerID value sorting by creationTimestamp and paginating. This is the most frequent query in my experience;
We get all the documents querying with dataset (it has a one to many relationship with containerID, so one dataset has multiple containerIDs associated) sorting by creationTimestamp and paginating.

I investigated the containerID field, and I discovered that we have for each key a range of (1, 10.000] values. The average is 200-300 values I think, but it is not a complete even distribution.

My proposed solution would be to add a field in my data model, hashedContainerID that I would populate with my application code that is a hash for my containerID, and use as a shard key {hashedContainerID, creationTimestamp}.
In this way, I should avoid the problem of monotonically increasing key (because containerID has been hashed) and enriching the creationTimestamp in the shardKey I should be able to avoid jumbo blocks because further splitting by mongo could still happen for a single containerID.

Is there any pitfall in this that I am not seeing?
Can this scale up well allowing good performance in writing and at the same time avoid the scatter-gather pattern for the queries using containerID especially?

Thank you for reading this far!