i have a collection whose size is expected to have 50M records, right now i’m doing a poc with 3M records in which i aggregate just using group and it takes 20 sec, how can i make it fast? Note: i have a index but mongo ends up doing collscan.
You may try to $project source:1 and _id:0 before the $group.
Please do not post the same or almost the same question twice. It slows us all as we have to read duplicate or almost duplicate posts. Almost duplicate is even worst as we have to analyze the difference and try to determine which one is the correct one.
Regarding duplication, I actually found “working with data” to be a better category, so I duplicated the post there. I wanted to delete the original post but couldn’t find any option to do so.
I have tried a few approaches. Let me know if I missed any other methods to solve the problem.
Approach 1:
I tried using the $project stage, but it didn’t really bring down the response time. The pipeline below takes 2.30 minutes for 7 million documents.
Note: The explain plan shows the winning plan as { stage: 'COLLSCAN', direction: 'forward' }. I was expecting the stage to be PROJECTION_COVERED.
I then tried using OLAP cubes with on-demand materialized views, which gave great results, but it has its own limitations regarding dimensionality and filtering.
Approach 3:
I considered columnar indexes, but they don’t seem to be fully available yet, so I couldn’t try them out.
Add a $match field that filter on source that makes sure all documents get selected. That might force the engine to use the index.
Do the $group work in a $lookup
One thing to know about $group is that it is blocking. In some cases, I got better results by doing nothing in the $group except the _id part (which is kind of a distinct). Then, I $lookup in the same collection with localField:_id, foreignField:source and an aggregation pipeline with a $count stage.