Aggregating 50M of records

aditya_rai1 · June 14, 2024, 3:59pm

i have a collection whose size is expected to have 50M records, right now i’m doing a poc with 3M records in which i aggregate just using group and it takes 20 sec, how can i make it fast? Note: i have a index but mongo ends up doing collscan.

Below query:

[
  {
    $group: {
      _id: "$source",
      sum: {
        $sum: 1
      }
    }
  }
]

steevej · June 15, 2024, 5:00pm

You may try to $project source:1 and _id:0 before the $group.

Please do not post the same or almost the same question twice. It slows us all as we have to read duplicate or almost duplicate posts. Almost duplicate is even worst as we have to analyze the difference and try to determine which one is the correct one.

aditya_rai1 · June 16, 2024, 6:43am

Thanks for the reply, Steeve.

Regarding duplication, I actually found “working with data” to be a better category, so I duplicated the post there. I wanted to delete the original post but couldn’t find any option to do so.

I have tried a few approaches. Let me know if I missed any other methods to solve the problem.

Approach 1:

I tried using the $project stage, but it didn’t really bring down the response time. The pipeline below takes 2.30 minutes for 7 million documents.

Note: The explain plan shows the winning plan as { stage: 'COLLSCAN', direction: 'forward' }. I was expecting the stage to be PROJECTION_COVERED.

[
  {
    $project:
      {
        source: 1,
        _id: 0
      }
  },
  {
    $group:
      {
        _id: "$source",
        count: {
          $sum: 1
        }
      }
  }
]

Approach 2:

I then tried using OLAP cubes with on-demand materialized views, which gave great results, but it has its own limitations regarding dimensionality and filtering.

Approach 3:

I considered columnar indexes, but they don’t seem to be fully available yet, so I couldn’t try them out.

For now, I’m going with Approach 2.

steevej · June 18, 2024, 2:52pm

Next thing to try is to provide an index hint.

steevej · June 19, 2024, 12:06pm

Other hacks to try.

Artificial $match

Add a $match field that filter on source that makes sure all documents get selected. That might force the engine to use the index.

Do the $group work in a $lookup

One thing to know about $group is that it is blocking. In some cases, I got better results by doing nothing in the $group except the _id part (which is kind of a distinct). Then, I $lookup in the same collection with localField:_id, foreignField:source and an aggregation pipeline with a $count stage.

aditya_rai1 · June 19, 2024, 7:00pm

I forced engine to use the index by using hint but still wasn’t able to optimize the group by.

Even the 2nd hack is not working atleast not in my case.

But anyways my use case is satisfied by approach 2.
Thanks for the ideas, steeve.