Order of $sort and $limit aggregation

Aniket_Jha · February 1, 2022, 12:33pm

Does the order of $skip and $limit matter in the pipeline?

I use mongoose aggregation which runs a pipeline of operations. With skip and limit appended at the end.

   projectModel.aggregate(pipeline)
  .sort({ updatedAt: -1 })
  .skip(skip)
  .limit(limit)

My pipeline simply looks like

$match(userId) > $lookup(html_collection) > $lookup(records_collection) > $sort(on updatedAt from mongoose) > $skip(mongoose) > $limit(mongoose)

What I observed during pagination is that the $limit is being respected but the $skip is happening only at end of the pipeline. For example:

Page1: skip = 0, limit = 10

The number of documents as per the .explain() clearing $match stage is 10.

Page 2: skip = 10, limit = 10

The number of documents clearing $match stage is 20 (skip + limit), and 20 documents made it to the next stage. $lookup is made on 20 documents. Slowing down our pipeline and it is at the last stage when $skip works to discard the first 10 documents. Here we wasted work on the first 10 documents.

This was causing an issue in our pipeline and it was getting slow for pagination.

SOLUTION: What we ended up doing was to move $limit followed by $skip after $match.
then $limit = skip + limit, $skip = skip. We thought that limiting documents as limit = skip + limit will fetch documents and $skip at the next stage would reject unnecessary documents, thereby giving $lookup stages only the expected documents.

Page 1: skip = 0, limit = 10
$limit = 0 + 10 = 10 followed by $skip = 0

Page 2: skip = 10, limit = 10
$limit = 10 + 10 followed by $skip = 10

our pipeline now looks like:

$match(userId) > $sort(updatedAt) > $limit(limit + skip) > $skip (skip) > $lookup(html_collection) > $lookup(records_collection)

Here is the sample collection scehema for your refernce:

    PROJECT COLLECTION     
    {
     id: ObjectId,
     records: [ObjectId],
     userId: ObjectId
    }
     
    RECORDS COLLECTION
     
    {
     id: ObjectId,
     text: string
    }
     
    HTML COLLECTION
     
    {
     id: ObjectId,
     html: string,
     projectId: ObjectId
    }

QUESTIONS:

Is this behavior intended or something is wrong here with the $skip and $limit?
If it is not a bug then why did $limit was respected even if it was present at end of stage and $skip was not respected?
Is the solution that we came up with is correct? as I think for last page there are too many documents clearing the $match stage, but that is also something that MongoDB internally did right…as seen on page 2 of in our case?

steevej · February 1, 2022, 1:42pm

Note that your sort, skip and limit in

are done after the whole pipeline is completed.

If you want us to comment on your pipeline, please post your pipeline in the standard pipeline language rather than your own notation like:

santimir · February 1, 2022, 1:48pm

Just replyig your first half of the post

Not 100% sure, but I believe they are doing it like that to get user input, which could be done different anyways, as a function myPipeline(skip, limit) could retrieve the full pipeline as an aggregation, etc.

But after all that is converted into a single query, and they are seeing the top-k / coalescence sorting, from the pipeline optimization.

Note that sort + limit are normally merged together, and if there is a skip, is between limit and sort, sort will done also using skip+limit.

{ $sort: { age : -1 } },
{ $skip: 10 },
{ $limit: 5 }

{
   "$sort" : {
      "sortKey" : {
         "age" : -1
      },
      "limit" : NumberLong(15)
   }
},
{
   "$skip" : NumberLong(10)
}

With all this said, you may be able to move skip to the first stages such that it occurs before the $lookup…

steevej · February 1, 2022, 1:52pm

Quite possible as they are using the mongoose abstraction (obstruction as I like to say) layer.

santimir · February 1, 2022, 2:00pm

Second @steevej that including something we can test will help, maybe straight in mongoplayground.

Note that the solution you found - which is correct imho - is irrelevant of where $limit is (as lon as after $sort), and it will coalesce into $sort as shown in my prev. comment.

But you need $skip right where you put it last, in an aggregation pipeline, before the $lookup.

So, as you wrote, this

$match(userId) > $sort(updatedAt) > $limit(limit + skip) > $skip (skip) > $lookup(html_collection) > $lookup(records_collection)

should be great.

Aniket_Jha · February 2, 2022, 12:29pm

Thanks @santimir and @steevej for your response. The link was helpful. I’ve put simple pipeline on mongoplayground it is not the exact pipeline. I think I have got answer to questions.
Thanks

santimir · February 2, 2022, 1:31pm

@Aniket_Jha thanks for including an example.

Hopefully I understand this right…I’ll give you my current opinion.

The pipeline (in the $skip:10 $limit:20 case) isn’t doing a $lookup for 30 docs, but for 10. The reason as detailed before is the pipeline optimizer (added links at the bottom of the post).

To put it briefly, MongoDB optimizes / ‘logically rearranges’ stages, and it will ( hopefully ) get the same output. Sort, skip and limit are moved to the first stage by the optimizer:

 "stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "parsedQuery": {
            "userId": {
              "$eq": 1
            }
          },
          "rejectedPlans": [],
          "winningPlan": {
            "inputStage": {
              "inputStage": {
                "direction": "forward",
                "filter": {
                  "userId": {
                    "$eq": 1
                  }
                },
                "stage": "COLLSCAN"
              },
              "limitAmount": 30,
              "memLimit": 104857600,
              "sortPattern": {
                "updatedAt": -1
              },
              "stage": "SORT",
              "type": "simple"
            },
            "skipAmount": 10,
            "stage": "SKIP"
          }
        }
      }
    },
    {
      "$lookup": {
        "as": "content",
        "foreignField": "_id",
        "from": "html",
        "localField": "docId"
      }
    },
    {
      "$lookup": {
        "as": "resources",
        "foreignField": "_id",
        "from": "resources",
        "localField": "resources"
      }
    }
  ]

If you need this query to be blazing fast, the $match field should be indexed. Also, the look up fields in the other collections should – probably – be indexed.

I think that, often times, we want to avoid ‘COLLSCAN’ because that’s quite slow process.

If you’re getting a different query plan, it may be the way mongoose is building the request.

Edit 1

Useful Links

Edit 2

Maybe userId and updateAt could be indexed in one single index, so the query doesn’t need to access the original collection, and only uses index collection. This should be tested.

steevej · February 3, 2022, 3:34am

Try

db.documents.aggregate([
  {
    "$match": {
      "userId": 1
    }
  },
  {
    "$sort": {
      updatedAt: -1
    }
  },
  {
    "$skip": 0
  },
  {
    "$limit": 10
  }
  {
    "$lookup": {
      "from": "html",
      "localField": "docId",
      "foreignField": "_id",
      "as": "content"
    }
  },
  {
    "$lookup": {
      "from": "resources",
      "localField": "resources",
      "foreignField": "_id",
      "as": "resources"
    }
  }
])

with index

{ userId : 1 , updatedAt : -1 }