Slow query on 10 millions records of data when applying grouping aggregation

Chady_Badour · April 24, 2024, 12:37pm

I have a mongo collection where i want to apply an aggregation on it that includes matching, ordering and sorting.
I have around 10 million records and i am testing on my local environment using Mongo bd compass.

My issue is that my query is taking around 7 seconds of execution when i need it to be less then 2 seconds.

This is an example object from my collection



_id
6627c4314c11d4072fef0c37
manager_id
4
client_id
1058
agent_id
1924
source
"home_page"
property_id
292498
property_reference
"054"
property_personal_reference
"054"
property_category_id
2
property_type_id
1
property_price_type_id
4
property_price
1500
property_community_id
969
property_district_id
28
property_province_id
2
property_bedroom_id
3
property_bathroom_id
4
property_area
160
property_boosted
false
property_listing_level
"standard"
created_at
2023-09-11T15:45:56.000+00:00
`

And this is my aggregation

[
{
$match: {
manager_id: 4,
created_at: {
$gte: ISODate(‘2022-08-11’),
$lte: ISODate(‘2023-08-25’)
}
},
},
{
$group: {
_id: {
property_id: ‘$property_id’,
client_id: ‘$client_id’,
agent_id: ‘$agent_id’,
},
impressions_count: {
$sum: 1
},
property_category_id: {
$first: ‘$property_category_id’
},
property_community_id: {
$first: ‘$property_community_id’
},
property_district_id: {
$first: ‘$property_district_id’
},
property_province_id: {
$first: ‘$property_province_id’
}
},
},
{
$project: {
_id: 0,
property_id: ‘$_id.property_id’,
client_id: ‘$_id.client_id’,
agent_id: ‘$_id.agent_id’,
impressions_count: 1,
property_category_id: 1,
property_community_id: 1,
property_district_id: 1,
property_province_id: 1
}
},
{
$sort: {
impressions_count: 1,
property_id: 1,
}
},
{
$limit: 10
}
]


i've created an index to cover this query and it is actualy covering the query. this is the explain result of the query

![image|690x352](upload://pzw5DSfpDOe4pDQQDAPXLg3J42E.png)


I need help please !!!!!

Chady_Badour · April 24, 2024, 8:12pm

Sorry this is the explain result screenshot

Peter_Hubbard · April 25, 2024, 3:15pm

$group is a blocking stage, as in it needs to wait for every document to pass through before it can output your groups. The more data you can filter out in the $match stage, the better your performance will be.

However, you can see in your explain plan that the IXSCAN (index scan) phase is taking the longest. It also is looking at 3.4m keys but only return 10,000 documents. Make sure that you have an index which matches your query (the $match stage) and keep it as small as possible.

Chady_Badour · April 26, 2024, 7:18am

This is my update aggragation

[
  {
    $match: {
      manager_id: 4,
      day_start: {
        $gte: ISODate("2021-03-20"),
        $lte: ISODate("2025-12-25"),
      },
    },
  },
  {
    $group: {
      _id: {
        property_id: "$property_id",
        client_id: "$client_id",
        agent_id: "$agent_id",
      },
      impressions_count: {
        $sum: 1,
      },
      property_community_id: {
        $first: "$property_community_id",
      },
    },
  },
  {
    $sort: {
      impressions_count: 1,
    },
  },
  {
    $project: {
      _id: 0,
      property_id: "$_id.property_id",
      client_id: "$_id.client_id",
      agent_id: "$_id.agent_id",
      impressions_count: 1,
      property_community_id: 1,
    },
  },
]

And that is my improved explain result (Including the index that i used), the query preformance has improved a little but we still have 3740444 index keys examined.
I think my issue is with the date index, how should i create my index in this case to cover only what i need

Peter_Hubbard · April 26, 2024, 9:07am

Check the output for the following

countDocuments({
      manager_id: 4,
      day_start: {
        $gte: ISODate("2021-03-20"),
        $lte: ISODate("2025-12-25"),
      })

If this is 3740444, then that is just how many documents match the query. Your index is just helping the database to locate these documents quicker.

To improve performance you would need to be more selective in your query.

Chady_Badour · April 26, 2024, 11:07am

Your adviced method has returned 3740444, So my index is working fine
I can’t add nothing else to my match query. Maybe restructuring my collection schema could help ?

This is a document from my collection

{
  "_id": {
    "$oid": "662624769fdf3e902efdc912"
  },
  "manager_id": 4,
  "client_id": 826,
  "agent_id": 2388,
  "source": "home_page",
  "property_id": 306351,
  "property_reference": "elegant-2932883",
  "property_personal_reference": "elegant-2932883",
  "property_category_id": 1,
  "property_type_id": 1,
  "property_price_type_id": 1,
  "property_price": 850000,
  "property_community_id": 831,
  "property_district_id": 28,
  "property_province_id": 2,
  "property_bedroom_id": 5,
  "property_bathroom_id": 5,
  "property_area": 400,
  "property_boosted": false,
  "property_listing_level": "standard",
  "created_at": {
    "$date": "2022-10-10T00:00:00.000Z"
  },
  "day_start": {
    "$date": "2022-10-10T00:00:00.000Z"
  }
}

steevej · April 30, 2024, 1:09pm

One way to help performance is to reduce the size of the working set.

In your case, improving the schema will reduce somehow the working set. Your schema is verbose and since field names are stored with the documents, reducing the field names length will help. Rather than having a list of fields all prefixed with property_ you could have a more elegant schema by having a top object name perperty that groups together all your property_ fields like:

"property : {
  "id": 306351,
  "reference": "elegant-2932883",
  "personal_reference": "elegant-2932883",
  "category_id": 1,
  "type_id": 1,
  "price_type_id": 1,
  "price": 850000,
  "community_id": 831,
  "district_id": 28,
  "province_id": 2,
  "bedroom_id": 5,
  "bathroom_id": 5,
  "area": 400,
  "boosted": false,
  "listing_level": "standard",
}

This schema reduce your document size by 120 bytes. See some discussions about other advantages you get with this.

As mentioned by Peter, $group is blocking. You may reduce the blocking time by doing some of the $group work later. In you case, you could delay $sum and $first much later by moving these operations in a $lookup. The pipeline could look like:

let match = { /* same as you have */ }
let group = { $group: {
  _id: {
        property: "$property_id",
        client: "$client_id",
        agent: "$agent_id",
  },
} }
/* The memory taken by $group will be smaller and execute faster */
const lookup_count = { $lookup: {
    from: the_same_collection ,
    let :   _id: '$_id' ,
    pipeline: [
         { $match : { /* using $$_id against property_id, client_id */ } } ,
         { $count : /* ... */ }
    ] ,
    as : 'count'
} }
let sort = { $sort : { /* like you have */ } }
/* since sort is done in memory because it is done on a computed value by not having property_community_id in the sorted document, the working set is reduced */
let lookup_property = { /* a simple lookup just to get the property_community_id */ }

Another idea I got is based on the assumption (because you do $first on community_id) that the triplets have a unique community_id. You could start by a $group to get all the distinct community_id. Then you do a lookup to get the original $group triplets.

I remember a discussion, @chris and @Asya_Kamsky were maybe partipating, where this idea of $group then $lookup, to do the $group accumulator. All this to reduce the working set of $group and to reduce the block time.

steevej · May 8, 2024, 6:26pm

@Chady_Badour, it has been more than a week since I provided some things to think about for your issue. I would appreciate some followup. Even writing that you do not like my ideas and the reasons why is better than leaving the thread to die off.

Thanks

Chady_Badour · May 9, 2024, 7:16am

@steevej Sorry for that, i tried all of your proposed solutions but nothing worked, I started to implement and try the sharding solution where i split my data by date (Specialy by year)

Chady_Badour · May 9, 2024, 11:29am

Our previous implementation was done on My SQL, and due to the large data added we suggested to move to mongo db, we’ve lost a lot of time with Mongo Db that i think it is not a solution for this type or size of Data, and my sql was honestly faster in this type of querying