I have collections with around 11M to 70Million documents that get read from secondaries, the aggregation pipeline’s $match contains the current date and an “hour” field which uses a compound index on all date fields in that $match to retrieve the previous hour’s set of data, but for some reason when looking at how it comes through on mongo - there is a predefined $match before anything else:
{
"$match": {
"_id": {
"$lt": "(ObjectId of the previous hour)"
}
}
},
My aggregation pipeline runs after this $match - therefore creating slow, long running queries due to it reading all documents prior to the previous hour, which could be up to 70 million at times.
is this due to the spark connectors default partitioning field being on _id? and if i were to change the partitioning would it have to be a unique field or could i use a field that is being used, and that is indexed such as the date field being 202310 for example?