Aggregation Pipeline Optimization
On this page
Aggregation pipeline operations have an optimization phase which attempts to reshape the pipeline for improved performance.
To see how the optimizer transforms a particular aggregation pipeline,
include the explain
option in the
db.collection.aggregate()
method.
Optimizations are subject to change between releases.
You can run aggregation pipelines in the UI for deployments hosted in MongoDB Atlas.
Projection Optimization
The aggregation pipeline can determine if it requires only a subset of the fields in the documents to obtain the results. If so, the pipeline will only use those required fields, reducing the amount of data passing through the pipeline.
Pipeline Sequence Optimization
($project
or $unset
or $addFields
or $set
) + $match
Sequence Optimization
For an aggregation pipeline that contains a projection stage
($addFields
, $project
, $set
, or
$unset
) followed by a $match
stage, MongoDB
moves any filters in the $match
stage that do not require values
computed in the projection stage to a new $match
stage before the
projection.
If an aggregation pipeline contains multiple projection or $match
stages, MongoDB performs this optimization for each $match
stage,
moving each $match
filter before all projection stages that the
filter does not depend on.
Consider a pipeline with the following stages:
{ $addFields: { maxTime: { $max: "$times" }, minTime: { $min: "$times" } } }, { $project: { _id: 1, name: 1, times: 1, maxTime: 1, minTime: 1, avgTime: { $avg: ["$maxTime", "$minTime"] } } }, { $match: { name: "Joe Schmoe", maxTime: { $lt: 20 }, minTime: { $gt: 5 }, avgTime: { $gt: 7 } } }
The optimizer breaks up the $match
stage into four individual
filters, one for each key in the $match
query document. The
optimizer then moves each filter before as many projection stages as
possible, creating new $match
stages as needed.
Given this example, the optimizer automatically produces the following optimized pipeline:
{ $match: { name: "Joe Schmoe" } }, { $addFields: { maxTime: { $max: "$times" }, minTime: { $min: "$times" } } }, { $match: { maxTime: { $lt: 20 }, minTime: { $gt: 5 } } }, { $project: { _id: 1, name: 1, times: 1, maxTime: 1, minTime: 1, avgTime: { $avg: ["$maxTime", "$minTime"] } } }, { $match: { avgTime: { $gt: 7 } } }
Note
The optimized pipeline is not intended to be run manually. The original and optimized pipelines return the same results.
You can see the optimized pipeline in the explain plan.
The $match
filter { avgTime: { $gt: 7 } }
depends on the
$project
stage to compute the avgTime
field. The
$project
stage is the last projection stage in this
pipeline, so the $match
filter on avgTime
could not be
moved.
The maxTime
and minTime
fields are computed in the
$addFields
stage but have no dependency on the
$project
stage. The optimizer created a new
$match
stage for the filters on these fields and placed it
before the $project
stage.
The $match
filter { name: "Joe Schmoe" }
does not
use any values computed in either the $project
or
$addFields
stages so it was moved to a new
$match
stage before both of the projection stages.
After optimization, the filter { name: "Joe Schmoe" }
is in a
$match
stage at the beginning of the pipeline. This has the
added benefit of allowing the aggregation to use an index on the
name
field when initially querying the collection.
$sort
+ $match
Sequence Optimization
When you have a sequence with $sort
followed by a
$match
, the $match
moves before the
$sort
to minimize the number of objects to sort. For
example, if the pipeline consists of the following stages:
{ $sort: { age : -1 } }, { $match: { status: 'A' } }
During the optimization phase, the optimizer transforms the sequence to the following:
{ $match: { status: 'A' } }, { $sort: { age : -1 } }
$redact
+ $match
Sequence Optimization
When possible, when the pipeline has the $redact
stage
immediately followed by the $match
stage, the aggregation
can sometimes add a portion of the $match
stage before the
$redact
stage. If the added $match
stage is at
the start of a pipeline, the aggregation can use an index as well as
query the collection to limit the number of documents that enter the
pipeline. See Pipeline Operators and Indexes for
more information.
For example, if the pipeline consists of the following stages:
{ $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } }, { $match: { year: 2014, category: { $ne: "Z" } } }
The optimizer can add the same $match
stage before the
$redact
stage:
{ $match: { year: 2014 } }, { $redact: { $cond: { if: { $eq: [ "$level", 5 ] }, then: "$$PRUNE", else: "$$DESCEND" } } }, { $match: { year: 2014, category: { $ne: "Z" } } }
$project
/$unset
+ $skip
Sequence Optimization
New in version 3.2.
When you have a sequence with $project
or $unset
followed by
$skip
, the $skip
moves before $project
. For example, if
the pipeline consists of the following stages:
{ $sort: { age : -1 } }, { $project: { status: 1, name: 1 } }, { $skip: 5 }
During the optimization phase, the optimizer transforms the sequence to the following:
{ $sort: { age : -1 } }, { $skip: 5 }, { $project: { status: 1, name: 1 } }
Pipeline Coalescence Optimization
When possible, the optimization phase coalesces a pipeline stage into its predecessor. Generally, coalescence occurs after any sequence reordering optimization.
$sort
+ $limit
Coalescence
Changed in version 4.0.
When a $sort
precedes a $limit
, the optimizer
can coalesce the $limit
into the $sort
if no
intervening stages modify the number of documents
(e.g. $unwind
, $group
).
MongoDB will not coalesce the $limit
into the
$sort
if there are pipeline stages that change the number of
documents between the $sort
and $limit
stages..
For example, if the pipeline consists of the following stages:
{ $sort : { age : -1 } }, { $project : { age : 1, status : 1, name : 1 } }, { $limit: 5 }
During the optimization phase, the optimizer coalesces the sequence to the following:
{ "$sort" : { "sortKey" : { "age" : -1 }, "limit" : NumberLong(5) } }, { "$project" : { "age" : 1, "status" : 1, "name" : 1 } }
This allows the sort operation to only maintain the
top n
results as it progresses, where n
is the specified limit,
and MongoDB only needs to store n
items in memory
[1]. See $sort
Operator and Memory for more
information.
Note
Sequence Optimization with $skip
[1] | The optimization will still apply when
allowDiskUse is true and the n items exceed the
aggregation memory limit. |
$limit
+ $limit
Coalescence
When a $limit
immediately follows another
$limit
, the two stages can coalesce into a single
$limit
where the limit amount is the smaller of the two
initial limit amounts. For example, a pipeline contains the following
sequence:
{ $limit: 100 }, { $limit: 10 }
Then the second $limit
stage can coalesce into the first
$limit
stage and result in a single $limit
stage where the limit amount 10
is the minimum of the two initial
limits 100
and 10
.
{ $limit: 10 }
$skip
+ $skip
Coalescence
When a $skip
immediately follows another $skip
,
the two stages can coalesce into a single $skip
where the
skip amount is the sum of the two initial skip amounts. For example, a
pipeline contains the following sequence:
{ $skip: 5 }, { $skip: 2 }
Then the second $skip
stage can coalesce into the first
$skip
stage and result in a single $skip
stage where the skip amount 7
is the sum of the two initial
limits 5
and 2
.
{ $skip: 7 }
$match
+ $match
Coalescence
When a $match
immediately follows another
$match
, the two stages can coalesce into a single
$match
combining the conditions with an
$and
. For example, a pipeline contains the following
sequence:
{ $match: { year: 2014 } }, { $match: { status: "A" } }
Then the second $match
stage can coalesce into the first
$match
stage and result in a single $match
stage
{ $match: { $and: [ { "year" : 2014 }, { "status" : "A" } ] } }
$lookup
+ $unwind
Coalescence
New in version 3.2.
When a $unwind
immediately follows another
$lookup
, and the $unwind
operates on the as
field of the $lookup
, the optimizer can coalesce the
$unwind
into the $lookup
stage. This avoids
creating large intermediate documents.
For example, a pipeline contains the following sequence:
{ $lookup: { from: "otherCollection", as: "resultingArray", localField: "x", foreignField: "y" } }, { $unwind: "$resultingArray"}
The optimizer can coalesce the $unwind
stage into the
$lookup
stage. If you run the aggregation with explain
option, the explain
output shows the coalesced stage:
{ $lookup: { from: "otherCollection", as: "resultingArray", localField: "x", foreignField: "y", unwinding: { preserveNullAndEmptyArrays: false } } }
Indexes
An aggregation pipeline can use indexes from the input collection to improve performance. Using an index limits the amount of documents a stage processes. Ideally, an index can cover the stage query. A covered query has especiallly high performance, since the index returns all matching documents.
For example, a pipeline that consists of $match
,
$sort
, $group
can benefit from indexes at
every stage:
An index on the
$match
query field can efficiently identify the relevant dataAn index on the sorting field can return data in sorted order for the
$sort
stageAn index on the grouping field that matches the
$sort
order can return all of the field values needed to execute the$group
stage (a covered query)
To determine whether a pipeline uses indexes, review the query plan and
look for IXSCAN
or DISTINCT_SCAN
plans.
Note
In some cases, the query planner uses a DISTINCT_SCAN
index plan
that returns one document per index key value. DISTINCT_SCAN
executes faster than IXSCAN
if there are multiple documents per
key value. However, index scan parameters might affect the time
comparison of DISTINCT_SCAN
and IXSCAN
.
For early stages in your aggregation pipeline, consider indexing the query fields. Stages that can benefit from indexes are:
$match
stage$match
can use an index to filter documents if it is the first stage in the pipeline, after any optimizations from the query planner.$sort
stage$sort
can benefit from an index as long as it is not preceded by a$project
,$unwind
, or$group
stage.$group
stage$group
can use an index to find the first document in each group if it meets all of the following conditions:an index exists that matches the sort order on the grouped field
See $group Performance Optimizations for an example.
$geoNear
stage$geoNear
always uses an index, since it must be the first stage in a pipeline and requires a geospatial index.
Additionally, stages later in the pipeline that retrieve data from other, unmodified collections can use indexes on those collections for optimization. These stages include:
Indexes can cover queries in an aggregation pipeline. A covered query uses an index to return all of the documents and has high performance.
Example
$sort
+ $skip
+ $limit
Sequence
A pipeline contains a sequence of $sort
followed by a
$skip
followed by a $limit
:
{ $sort: { age : -1 } }, { $skip: 10 }, { $limit: 5 }
The optimizer performs $sort
+ $limit
Coalescence to
transforms the sequence to the following:
{ "$sort" : { "sortKey" : { "age" : -1 }, "limit" : NumberLong(15) } }, { "$skip" : NumberLong(10) }
MongoDB increases the $limit
amount with the reordering.