MongoDB Spark connector partition field

Gareth_Furnell · October 10, 2023, 8:04am

I have collections with around 11M to 70Million documents that get read from secondaries, the aggregation pipeline’s $match contains the current date and an “hour” field which uses a compound index on all date fields in that $match to retrieve the previous hour’s set of data, but for some reason when looking at how it comes through on mongo - there is a predefined $match before anything else:

 {
                "$match": {
                    "_id": {
                        "$lt": "(ObjectId of the previous hour)"
                    }
                }
            },

My aggregation pipeline runs after this $match - therefore creating slow, long running queries due to it reading all documents prior to the previous hour, which could be up to 70 million at times.

is this due to the spark connectors default partitioning field being on _id? and if i were to change the partitioning would it have to be a unique field or could i use a field that is being used, and that is indexed such as the date field being 202310 for example?

Gareth_Furnell · October 10, 2023, 10:33am

Also, with regards to the documentation:

in the aggregation.pipeline setting, can someone explain the {"$match": {"closed": false}} meaning and if the closed:false is necessary in the config/code - or how the implementation works