Optimizing $lookup Performance Using the Power of Indexing
Rate this tutorial
Have you ever had to join data from two collections and wanted to know how to improve the queries performance? Let's walk through how indexing can help power faster search results when using $lookup.
$lookup is a stage in an aggregation pipeline that performs a left outer join to a collection to filter in documents from the "joined" collection for processing. $lookup creates a new array in the pipeline, where each field is a returned document from the $lookup.
Indexes within a database are extremely powerful, as they can significantly reduce resource usage and query duration. Without them, the queries would have to scan every document in a collection in order to return the query result, which takes a lot of time and resources. It is therefore very important to have indexes in place when using $lookup to “join” collections.
For $lookup performance to be optimized, it is good practice to have indexes on both collections, the source collection and joining collection. That way, the main query on the source collection will benefit from an index as will the query on the joined collection.
To walk you through how we can optimize $lookup queries, we will use movie data as an example. Our query uses data from two collections,
genres and movies. We want to find all the movies that fall under the genres of comedy and drama.Documents from source and joining collections
Collection: genres
1 db.genres.insertMany([ 2 { "_id": 0, "genre": "Comedy" }, 3 { "_id": 1, "genre": "Drama" }, 4 { "_id": 2, "genre": "Action" }, 5 { "_id": 3, "genre": "Romance" }, 6 { "_id": 4, "genre": "Adventure" }, 7 { "_id": 5, "genre": "Family" }, 8 { "_id": 6, "genre": "Fantasy" }, 9 { "_id": 7, "genre": "Thriller" } 10 ])
Collection: movies
1 db.movies.insertMany([ 2 { "_id": 0, "title": "Billy Madison", "genre_codes": [0], "year": 1995, "directors": ["Tamra Davis"] }, 3 { "_id": 1, "title": "Happy Gilmore", "genre_codes": [0, 5], "year": 1996, "directors": ["Dennis Dugan"] }, 4 { "_id": 2, "title": "The Waterboy", "genre_codes": [0, 5], "year": 1998, "directors": ["Frank Coraci"] }, 5 { "_id": 3, "title": "Big Daddy", "genre_codes": [0, 5], "year": 1999, "directors": ["Dennis Dugan"] }, 6 { "_id": 4, "title": "Little Nicky", "genre_codes": [0], "year": 2000, "directors": ["Steven Brill"] }, 7 { "_id": 5, "title": "Mr. Deeds", "genre_codes": [0], "year": 2002, "directors": ["Dennis Dugan"] }, 8 { "_id": 6, "title": "50 First Dates", "genre_codes": [0, 3], "year": 2004, "directors": ["Peter Segal"] }, 9 { "_id": 7, "title": "The Longest Yard", "genre_codes": [0, 2], "year": 2005, "directors": ["Peter Segal"] }, 10 { "_id": 8, "title": "Grown Ups", "genre_codes": [0, 5], "year": 2010, "directors": ["Dennis Dugan"] }, 11 { "_id": 9, "title": "Just Go with It", "genre_codes": [0, 3], "year": 2011, "directors": ["Dennis Dugan"] }, 12 { "_id": 10, "title": "Blended", "genre_codes": [0, 5], "year": 2014, "directors": ["Frank Coraci"] }, 13 { "_id": 11, "title": "The Do-Over", "genre_codes": [0, 2], "year": 2016, "directors": ["Steven Brill"] }, 14 { "_id": 12, "title": "Murder Mystery", "genre_codes": [0, 7], "year": 2019, "directors": ["Kyle Newacheck"] }, 15 { "_id": 13, "title": "You Are So Not Invited to My Bat Mitzvah", "genre_codes": [0], "year": 2023, "directors": ["Sammi Cohen"] }, 16 { "_id": 14, "title": "Punch-Drunk Love", "genre_codes": [1], "year": 2002, "directors": ["Paul Thomas Anderson"] }, 17 { "_id": 15, "title": "Reign Over Me", "genre_codes": [1], "year": 2007, "directors": ["Mike Binder"] }, 18 { "_id": 16, "title": "Funny People", "genre_codes": [1], "year": 2009, "directors": ["Judd Apatow"] }, 19 { "_id": 17, "title": "Uncut Gems", "genre_codes": [1, 7], "year": 2019, "directors": ["Josh Safdie", "Benny Safdie"] }, 20 { "_id": 18, "title": "Bedtime Stories", "genre_codes": [5, 6], "year": 2008, "directors": ["Adam Shankman"] }, 21 { "_id": 19, "title": "Hotel Transylvania", "genre_codes": [5, 6], "year": 2012, "directors": ["Genndy Tartakovsky"] }, 22 { "_id": 20, "title": "Pixels", "genre_codes": [0, 2], "year": 2015, "directors": ["Chris Columbus"] } 23 ])
Desired Output:
1 { 2 "_id": 1, 3 "genre": "Drama", 4 "movies": [ 5 { "title": "Funny People", "year": 2009 }, 6 { "title": "Punch-Drunk Love", "year": 2002 }, 7 { "title": "Reign Over Me", "year": 2007 }, 8 { "title": "Uncut Gems", "year": 2019 } 9 ] 10 } 11 12 { 13 "_id": 0, 14 "genre": "Comedy", 15 "movies": [ 16 { "title": "50 First Dates", "year": 2004 }, 17 { "title": "Big Daddy", "year": 1999 }, 18 { "title": "Billy Madison", "year": 1995 }, 19 { "title": "Blended", "year": 2014 }, 20 { "title": "Grown Ups", "year": 2010 }, 21 { "title": "Happy Gilmore", "year": 1996 }, 22 { "title": "Just Go with It", "year": 2011 }, 23 { "title": "Little Nicky", "year": 2000 }, 24 { "title": "Mr. Deeds", "year": 2002 }, 25 { "title": "Murder Mystery", "year": 2019 }, 26 { "title": "Pixels", "year": 2015 }, 27 { "title": "The Do-Over", "year": 2016 }, 28 { "title": "The Longest Yard", "year": 2005 }, 29 { "title": "The Waterboy", "year": 1998 }, 30 { "title": "You Are So Not Invited to My Bat Mitzvah", "year": 2023 } 31 ] 32 }
We can achieve our task of finding all the comedy and drama movies by creating an aggregation pipeline on the
genres collection and joining data from the movies collection. It is possible to create the aggregation on the movies collection, but for illustration purposes, we are going to assume we are joining from genres to movies.- We want to ensure that we are only going to return the documents that are under the genre of comedy and drama by using the $match stage.
- In the $lookup stage, we will perform a left join on the
moviescollection to retrieve the movie titles and years:
- The
fromfield is our joining collection. In this case, it's themoviescollection. - In the
let, we are assigning the “_id” field in thegenrescollection to the variable “genre_id”, which we will use in the pipeline. - In the
pipeline, we will use$matchto take the documents from themoviescollection where thegenre_id(from thegenrescollection) corresponds to the array “genre_codes” (from the movies collection). Then, we will use$projectto only return the title and year of each movie.
Let's run the aggregation and see how it is performing:
1 db.genres.aggregate([ 2 { "$match": { "genre": { "$in": [ "Comedy", "Drama" ] } } }, 3 { 4 "$lookup": { 5 "from": "movies", 6 "let": { "genre_id": "$_id" }, 7 "pipeline": [ 8 { "$match": { "$expr": { "$in": [ "$$genre_id", "$genre_codes" ] } } }, 9 { "$project": { _id: 0, "title": 1, "year":1 } } 10 ], 11 "as": "movies" 12 } 13 } 14 ]).explain("allPlansExecution")
Results:
1 executionStats: { 2 nReturned: 2, 3 totalKeysExamined: 0, 4 totalDocsExamined: 8, 5 executionStages: { 6 stage: 'COLLSCAN', 7 filter: { genre: { '$in': [ 'Comedy', 'Drama' ] } }, 8 nReturned: 2, 9 ... 10 "totalDocsExamined": 42, 11 "totalKeysExamined": 0, 12 "collectionScans": 2, 13 "indexesUsed": [], 14 "nReturned": 2
Now, we know that using indexes improves the performance of the program, so let's create them on their respective collections:
1 db.genres.createIndex({ "genre": 1 }) 2 db.movies.createIndex({ "genre_codes": 1 })
Let's run the aggregation and see how the performance improved by adding indexes:
1 db.genres.aggregate([ 2 { "$match": { "genre": { "$in": [ "Comedy", "Drama" ] } } }, 3 { 4 "$lookup": { 5 "from": "movies", 6 "let": { "genre_id": "$_id" }, 7 "pipeline": [ 8 { "$match": { "$expr": { "$in": [ "$$genre_id", "$genre_codes" ] } } }, 9 { "$project": { _id: 0, "title": 1, "year":1 } } 10 ], 11 "as": "movies" 12 } 13 } 14 ]).explain("allPlansExecution")
Results:
1 "executionStats": { 2 "nReturned": 2, 3 "totalKeysExamined": 3, 4 "totalDocsExamined": 2, 5 "stage": "IXSCAN", 6 "nReturned": 2, 7 "indexName": "genre_1", 8 ... 9 "totalDocsExamined": 42, 10 "totalKeysExamined": 0, 11 "collectionScans": 2, 12 "indexesUsed": [], 13 "nReturned": 2,
Uh oh! We can see through the
executionStats output that this aggregation did not use the index on the movies that we created. The totalDocsExamined is 42, meanwhile there are only eight documents in the genres collection and 21 in the movies collection! The reason the number of documents examined can get so high is because for each of the two genres, it scans the entire movies collection. This leads to two collection scans, which is suboptimal and can lead to significant performance issues. When a $lookup takes place, each document of the source collection will scan the entire joining collection, causing the amount of documents scanned to be significantly high compared to the amount of documents returned.The reason that this method failed to use an index is that
$expr cannot use an index when one of the operands is an array, like the $genre_codes field in this example.To avoid using an array operand with
$expr to evaluate if the genre codes are present, we can use localField and foreignField instead, which will provide the same functionality:localFieldis the field withingenres, the source collection that we want to perform an equality match withforeignField, in this case_id.foreignFieldis the field withinmovies, the joining collection where we want to perform an equality match withlocalField, in this casegenre_codes.- Since we already established which variables we are going to be using in
pipelinethrough thelocalFieldandforeignField, we can leave theletempty, and remove the$matchstage from ourpipeline.
Aggregate:
1 db.genres.aggregate([ 2 { $match: { genre: { $in: ["Comedy", "Drama"] } } }, 3 { 4 $lookup: { 5 from: "movies", 6 localField: "_id", 7 foreignField: "genre_codes", 8 let: {}, 9 pipeline: [{ $project: { _id: 0, title: 1, year: 1 } }], 10 as: "movies" 11 } 12 } 13 ]).explain("allPlansExecution")
By utilizing
localField and foreignField, we are able to perform an equality match with fields from different collections without using $expr and $in.1 "executionStats": { 2 "nReturned": 2, 3 "totalKeysExamined": 3, 4 "totalDocsExamined": 2, 5 "stage": "IXSCAN", 6 "nReturned": 2, 7 "indexName": "genre_1", 8 ... 9 "totalDocsExamined": 19, 10 "totalKeysExamined": 19, 11 "collectionScans": 0, 12 "indexesUsed": ["genre_codes_1"], 13 "nReturned": 2
We can see from the
.explain() output that there was a 65% improvement in performance. We have successfully used the index genre_codes_1 on the movies collection. This improves performance as we see only 19 documents were examined. However, we can improve performance even more by including the projection fields in the index.Now that we see that implementing
localField and foreignField will use the indexes we created, let's optimize the query by including the projection fields as part of the index to perform a covered query.1 db.movies.createIndex({ "genre_codes":1, "title":1, "year":1}) 2 3 "totalDocsExamined": 0, 4 "totalKeysExamined": 19, 5 "collectionScans": 0, 6 "indexesUsed": [ "genre_codes_1_title_1_year_1" ], 7 "nReturned": 2,
Yay! Our index was used and it performs a covered query (a query that only uses the index to retrieve results)! :) The explain output shows that totalDocsExamined is 0, which means it did not have to fetch any documents, whether for evaluating or projecting. It was all in the index already, ready for use.
Hopefully, this has been a thorough illustration of how you can optimize the performance of $lookup when one of the operands is an array. If you would like to test this in your local environment, please find instructions on creating the
movies and genres collections below.Thanks for reading!
Top Comments in Forums
There are no comments on this article yet.
Related
News & Announcements
Improved Error Messages for Schema Validation in MongoDB 5.0
Jun 14, 2023 | 10 min read
Tutorial
How to Maintain Multiple Versions of a Record in MongoDB (2024 Updates)
Aug 12, 2024 | 6 min read