Improving Query Performance for Conditional Summation in groupby

Kartik_Paul · August 8, 2024, 5:41am

I am running the following query on a dataset with one billion records. The query becomes significantly slower when I add a sort by userId. The execution time increases drastically. Can anyone guide me on how to optimize this query for better performance?

[  
    {
        "$group": {
            "_id": "$userId",
            "totalMonthly": { $sum: { $cond: [{ "$eq": ["$subscriptionType", "month"] }, 1, 0] } },
            "subscriptionAmt": { $sum: "$subscriptionAmt" },
            "subscriptionId": {
                "$first": "$subscriptionId"
            }
        }
    },
    {
        "$sort": { "totalMonthly": -1 }
    },
    {
        "$limit": 5
    }
]

steevej · August 9, 2024, 2:00pm

Why do you $sort then?

If you must sort make sure you have the appropriate index.

Kartik_Paul · August 9, 2024, 2:35pm

We have set the index and used sorting, but it has not improved the grouping performance, and it’s still taking a long time. Do you have any suggestions for making this pipeline faster?

steevej · August 13, 2024, 6:35pm

What works sometimes is to $group with _id only and do any computation in a $lookup pipeline.

Something like:

The theory, is that the first empty $group uses a lot less memory. While the computations are the same in the inner $group it is performed on a smaller set of documents and can be sent to the next stage faster.

But sorting on a computed value like totalMonthly will always be kind of slow since no index can be used. Sometimes the computed pattern is your best friend.

steevej · August 20, 2024, 12:38pm

Hi @Kartik_Paul, any followup on this last proposal?

Kartik_Paul · August 20, 2024, 2:14pm

We reviewed your last solution, but it takes longer than the existing query.

steevej · August 20, 2024, 2:18pm

Thanks for the feedback.