Optimizing $lookup Performance Using the Power of Indexing

Shani Roffe7 min read • Published Aug 30, 2024 • Updated Aug 30, 2024

MongoDB

Rate this tutorial

Have you ever had to join data from two collections and wanted to know how to improve the queries performance? Let's walk through how indexing can help power faster search results when using $lookup.

What is $lookup?

$lookup is a stage in an aggregation pipeline that performs a left outer join to a collection to filter in documents from the "joined" collection for processing. $lookup creates a new array in the pipeline, where each field is a returned document from the $lookup.

Indexes within a database are extremely powerful, as they can significantly reduce resource usage and query duration. Without them, the queries would have to scan every document in a collection in order to return the query result, which takes a lot of time and resources. It is therefore very important to have indexes in place when using $lookup to “join” collections.

For $lookup performance to be optimized, it is good practice to have indexes on both collections, the source collection and joining collection. That way, the main query on the source collection will benefit from an index as will the query on the joined collection.

Journey of performance improving through the use of indexes

To walk you through how we can optimize $lookup queries, we will use movie data as an example. Our query uses data from two collections, genres and movies. We want to find all the movies that fall under the genres of comedy and drama.

Documents from source and joining collections

Collection: genres

1 db.genres.insertMany([
2     { "_id": 0, "genre": "Comedy" },
3     { "_id": 1, "genre": "Drama" },
4     { "_id": 2, "genre": "Action" },
5     { "_id": 3, "genre": "Romance" },
6     { "_id": 4, "genre": "Adventure" },
7     { "_id": 5, "genre": "Family" },
8     { "_id": 6, "genre": "Fantasy" },
9     { "_id": 7, "genre": "Thriller" }
10 ])

Collection: movies

1     db.movies.insertMany([
2     { "_id": 0, "title": "Billy Madison", "genre_codes": [0], "year": 1995, "directors": ["Tamra Davis"] },
3     { "_id": 1, "title": "Happy Gilmore", "genre_codes": [0, 5], "year": 1996, "directors": ["Dennis Dugan"] },
4     { "_id": 2, "title": "The Waterboy", "genre_codes": [0, 5], "year": 1998, "directors": ["Frank Coraci"] },
5     { "_id": 3, "title": "Big Daddy", "genre_codes": [0, 5], "year": 1999, "directors": ["Dennis Dugan"] },
6     { "_id": 4, "title": "Little Nicky", "genre_codes": [0], "year": 2000, "directors": ["Steven Brill"] },
7     { "_id": 5, "title": "Mr. Deeds", "genre_codes": [0], "year": 2002, "directors": ["Dennis Dugan"] },
8     { "_id": 6, "title": "50 First Dates", "genre_codes": [0, 3], "year": 2004, "directors": ["Peter Segal"] },
9     { "_id": 7, "title": "The Longest Yard", "genre_codes": [0, 2], "year": 2005, "directors": ["Peter Segal"] },
10     { "_id": 8, "title": "Grown Ups", "genre_codes": [0, 5], "year": 2010, "directors": ["Dennis Dugan"] },
11     { "_id": 9, "title": "Just Go with It", "genre_codes": [0, 3], "year": 2011, "directors": ["Dennis Dugan"] },
12     { "_id": 10, "title": "Blended", "genre_codes": [0, 5], "year": 2014, "directors": ["Frank Coraci"] },
13     { "_id": 11, "title": "The Do-Over", "genre_codes": [0, 2], "year": 2016, "directors": ["Steven Brill"] },
14     { "_id": 12, "title": "Murder Mystery", "genre_codes": [0, 7], "year": 2019, "directors": ["Kyle Newacheck"] },
15     { "_id": 13, "title": "You Are So Not Invited to My Bat Mitzvah", "genre_codes": [0], "year": 2023, "directors": ["Sammi Cohen"] },
16     { "_id": 14, "title": "Punch-Drunk Love", "genre_codes": [1], "year": 2002, "directors": ["Paul Thomas Anderson"] },
17     { "_id": 15, "title": "Reign Over Me", "genre_codes": [1], "year": 2007, "directors": ["Mike Binder"] },
18     { "_id": 16, "title": "Funny People", "genre_codes": [1], "year": 2009, "directors": ["Judd Apatow"] },
19     { "_id": 17, "title": "Uncut Gems", "genre_codes": [1, 7], "year": 2019, "directors": ["Josh Safdie", "Benny Safdie"] },
20     { "_id": 18, "title": "Bedtime Stories", "genre_codes": [5, 6], "year": 2008, "directors": ["Adam Shankman"] },
21     { "_id": 19, "title": "Hotel Transylvania", "genre_codes": [5, 6], "year": 2012, "directors": ["Genndy Tartakovsky"] },
22     { "_id": 20, "title": "Pixels", "genre_codes": [0, 2], "year": 2015, "directors": ["Chris Columbus"] }
23 ])

Desired Output:

1 {
2     "_id": 1,
3     "genre": "Drama",
4     "movies": [
5       { "title": "Funny People", "year": 2009 },
6       { "title": "Punch-Drunk Love", "year": 2002 },
7       { "title": "Reign Over Me", "year": 2007 },
8       { "title": "Uncut Gems", "year": 2019 }
9     ]
10 }
11     
12 {
13     "_id": 0,
14     "genre": "Comedy",
15     "movies": [
16       { "title": "50 First Dates", "year": 2004 },
17       { "title": "Big Daddy", "year": 1999 },
18       { "title": "Billy Madison", "year": 1995 },
19       { "title": "Blended", "year": 2014 },
20       { "title": "Grown Ups", "year": 2010 },
21       { "title": "Happy Gilmore", "year": 1996 },
22       { "title": "Just Go with It", "year": 2011 },
23       { "title": "Little Nicky", "year": 2000 },
24       { "title": "Mr. Deeds", "year": 2002 },
25       { "title": "Murder Mystery", "year": 2019 },
26       { "title": "Pixels", "year": 2015 },
27       { "title": "The Do-Over", "year": 2016 },
28       { "title": "The Longest Yard", "year": 2005 },
29       { "title": "The Waterboy", "year": 1998 },
30       { "title": "You Are So Not Invited to My Bat Mitzvah", "year": 2023 }
31     ]
32 }

We can achieve our task of finding all the comedy and drama movies by creating an aggregation pipeline on the genres collection and joining data from the movies collection. It is possible to create the aggregation on the movies collection, but for illustration purposes, we are going to assume we are joining from genres to movies.

We want to ensure that we are only going to return the documents that are under the genre of comedy and drama by using the $match stage.
In the $lookup stage, we will perform a left join on the movies collection to retrieve the movie titles and years:

The from field is our joining collection. In this case, it's the movies collection.
In the let, we are assigning the “_id” field in the genres collection to the variable “genre_id”, which we will use in the pipeline.
In the pipeline, we will use $match to take the documents from the movies collection where the genre_id (from the genres collection) corresponds to the array “genre_codes” (from the movies collection). Then, we will use $project to only return the title and year of each movie.

Let's run the aggregation and see how it is performing:

1    db.genres.aggregate([
2     { "$match": { "genre": { "$in": [ "Comedy", "Drama" ] } } },
3     {
4         "$lookup": {
5             "from": "movies",
6             "let": { "genre_id": "$_id" },
7             "pipeline": [
8                 { "$match": { "$expr": { "$in": [ "$$genre_id", "$genre_codes" ] } } },
9                 { "$project": { _id: 0, "title": 1, "year":1 } }
10             ],
11             "as": "movies"
12         }
13     }
14 ]).explain("allPlansExecution")

Results:

1 executionStats: {
2           nReturned: 2,
3           totalKeysExamined: 0,
4           totalDocsExamined: 8,
5           executionStages: {
6             stage: 'COLLSCAN',
7             filter: { genre: { '$in': [ 'Comedy', 'Drama' ] } },
8             nReturned: 2,
9 	    ...
10 "totalDocsExamined": 42,
11 "totalKeysExamined": 0,
12 "collectionScans": 2,
13 "indexesUsed": [],
14 "nReturned": 2

Now, we know that using indexes improves the performance of the program, so let's create them on their respective collections:

1 db.genres.createIndex({ "genre": 1 })
2 db.movies.createIndex({ "genre_codes": 1 })

Let's run the aggregation and see how the performance improved by adding indexes:

1 db.genres.aggregate([
2     { "$match": { "genre": { "$in": [ "Comedy", "Drama" ] } } },
3     {
4         "$lookup": {
5             "from": "movies",
6             "let": { "genre_id": "$_id" },
7             "pipeline": [
8                 { "$match": { "$expr": { "$in": [ "$$genre_id", "$genre_codes" ] } } },
9                 { "$project": { _id: 0, "title": 1, "year":1 } }
10             ],
11             "as": "movies"
12         }
13     }
14 ]).explain("allPlansExecution")

Results:

1 "executionStats": {
2           "nReturned": 2,
3           "totalKeysExamined": 3,
4           "totalDocsExamined": 2,
5               "stage": "IXSCAN",
6               "nReturned": 2,
7               "indexName": "genre_1",
8               ...
9 "totalDocsExamined": 42,
10 "totalKeysExamined": 0,
11 "collectionScans": 2,
12 "indexesUsed": [],
13 "nReturned": 2,

Uh oh! We can see through the executionStats output that this aggregation did not use the index on the movies that we created. The totalDocsExamined is 42, meanwhile there are only eight documents in the genres collection and 21 in the movies collection! The reason the number of documents examined can get so high is because for each of the two genres, it scans the entire movies collection. This leads to two collection scans, which is suboptimal and can lead to significant performance issues. When a $lookup takes place, each document of the source collection will scan the entire joining collection, causing the amount of documents scanned to be significantly high compared to the amount of documents returned.

The reason that this method failed to use an index is that $expr cannot use an index when one of the operands is an array, like the $genre_codes field in this example.

Using an index on the joining collection

To avoid using an array operand with $expr to evaluate if the genre codes are present, we can use localField and foreignField instead, which will provide the same functionality:

localField is the field within genres, the source collection that we want to perform an equality match with foreignField, in this case _id.
foreignField is the field within movies, the joining collection where we want to perform an equality match with localField, in this case genre_codes.
Since we already established which variables we are going to be using in pipeline through the localField and foreignField, we can leave the let empty, and remove the $match stage from our pipeline.

Aggregate:

1 db.genres.aggregate([
2   { $match: { genre: { $in: ["Comedy", "Drama"] } } },
3   {
4     $lookup: {
5       from: "movies",
6       localField: "_id",
7       foreignField: "genre_codes",
8       let: {},
9       pipeline: [{ $project: { _id: 0, title: 1, year: 1 } }],
10       as: "movies"
11     }
12   }
13 ]).explain("allPlansExecution")

By utilizing localField and foreignField, we are able to perform an equality match with fields from different collections without using $expr and $in.

1 "executionStats": {
2           "nReturned": 2,
3           "totalKeysExamined": 3,
4           "totalDocsExamined": 2,
5           "stage": "IXSCAN",
6               "nReturned": 2,
7               "indexName": "genre_1",
8               ...
9 "totalDocsExamined": 19,
10 "totalKeysExamined": 19,
11 "collectionScans": 0,
12 "indexesUsed": ["genre_codes_1"],
13 "nReturned": 2

We can see from the .explain() output that there was a 65% improvement in performance. We have successfully used the index genre_codes_1 on the movies collection. This improves performance as we see only 19 documents were examined. However, we can improve performance even more by including the projection fields in the index.

Performing a covered query on the joining collection

Now that we see that implementing localField and foreignField will use the indexes we created, let's optimize the query by including the projection fields as part of the index to perform a covered query.

1 db.movies.createIndex({ "genre_codes":1, "title":1, "year":1})
2 
3 "totalDocsExamined": 0,
4 "totalKeysExamined": 19,
5 "collectionScans": 0,
6 "indexesUsed": [ "genre_codes_1_title_1_year_1" ],
7 "nReturned": 2,

Yay! Our index was used and it performs a covered query (a query that only uses the index to retrieve results)! :) The explain output shows that totalDocsExamined is 0, which means it did not have to fetch any documents, whether for evaluating or projecting. It was all in the index already, ready for use.

Hopefully, this has been a thorough illustration of how you can optimize the performance of $lookup when one of the operands is an array. If you would like to test this in your local environment, please find instructions on creating the movies and genres collections below.

Thanks for reading!

For more information on indexes and $lookup

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Tutorial

MongoDB Aggregation Pipeline Queries vs SQL Queries

May 10, 2022 | 7 min read

Article

Wordle Solving Using MongoDB Query API Operators

Feb 01, 2023 | 4 min read