Introduction to the MongoDB Aggregation Framework
Rate this quickstart
One of the difficulties when storing any data is knowing how it will be accessed in the future. What reports need to be run on it? What information is "hidden" in there that will allow for meaningful insights for your business? After spending the time to design your data schema in an appropriate fashion for your application, one needs to be able to retrieve it. In MongoDB, there are two basic ways that data retrieval can be done: through queries with the find() command, and through analytics using the aggregation framework and the aggregate() command.
find()
allows for the querying of data based on a condition. One can filter results, do basic document transformations, sort the documents, limit the document result set, etc. The aggregate()
command opens the door to a whole new world with the aggregation framework. In this series of posts, I'll take a look at some of the reasons why using the aggregation framework is so powerful, and how to harness that power.A frequently asked question is why do aggregation inside MongoDB at all? From the MongoDB documentation:
Aggregation operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
By using the built-in aggregation operators available in MongoDB, we are able to do analytics on a cluster of servers we're already using without having to move the data to another platform, like Apache Spark or Hadoop. While those, and similar, platforms are fast, the data transfer from MongoDB to them can be slow and potentially expensive. By using the aggregation framework the work is done inside MongoDB and then the final results can be sent to the application typically resulting in a smaller amount of data being moved around. It also allows for the querying of the LIVE version of the data and not an older copy of data from a batch.
Aggregation in MongoDB allows for the transforming of data and results in a more powerful fashion than from using the
find()
command. Through the use of multiple stages and expressions, you are able to build a "pipeline" of operations on your data to perform analytic operations. What do I mean by a "pipeline"? The aggregation framework is conceptually similar to the *nix
command line pipe, |
. In the *nix
command line pipeline, a pipe transfers the standard output to some other destination. The output of one command is sent to another command for further processing.In the aggregation framework, we think of stages instead of commands. And the stage "output" is documents. Documents go into a stage, some work is done, and documents come out. From there they can move onto another stage or provide output.
At the time of this writing, there are twenty-eight different aggregation stages available. These different stages provide the ability to do a wide variety of tasks. For example, we can build an aggregation pipeline that matches a set of documents based on a set of criteria, groups those documents together, sorts them, then returns that result set to us.
Or perhaps our pipeline is more complicated and the document flows through the
$match
, $unwind
, $group
, $sort
, $limit
, $project
, and finally a $skip
stage.This can be confusing and some of these concepts are worth repeating. Therefore, let's break this down a bit further:
- A pipeline starts with documents
- These documents come from a collection, a view, or a specially designed stage
- In each stage, documents enter, work is done, and documents exit
- The stages themselves are defined using the document syntax
Let's take a look at an example pipeline. Our documents are from the Sample Data that's available in MongoDB Atlas and the
routes
collection in the sample_training
database. Here's a sample document:1 { 2 "_id":{ 3 "$oid":"56e9b39b732b6122f877fa31" 4 }, 5 "airline":{ 6 "id":{ 7 "$numberInt":"410" 8 }, 9 "name":"Aerocondor" 10 ,"alias":"2B" 11 ,"iata":"ARD" 12 }, 13 "src_airport":"CEK", 14 "dst_airport":"KZN", 15 "Codeshare":"", 16 "stops":{ 17 "$numberInt":"0" 18 }, 19 "airplane":"CR2" 20 }
If you haven't yet set up your free cluster on MongoDB Atlas, now is a great time to do so. You have all the instructions in this blog post.
For this example query, let's find the top three airlines that offer the most direct flights out of the airport in Portland, Oregon, USA (PDX). To start with, we'll do a
$match
stage so that we can concentrate on doing work only on those documents that meet a base of conditions. In this case, we'll look for documents with a src_airport
, or source airport, of PDX and that are direct flights, i.e. that have zero stops.1 { 2 $match: { 3 "src_airport": "PDX", 4 "stops": 0 5 } 6 }
That reduces the number of documents in our pipeline down from 66,985 to 113. Next, we'll group by the airline name and count the number of flights:
1 { 2 $group: { 3 _id: { 4 "airline name": "$airline.name" 5 }, 6 count: { 7 $sum: 1 8 } 9 } 10 }
With the addition of the
$group
stage, we're down to 16 documents. Let's sort those with a $sort
stage and sort in descending order:1 { 2 $sort: { 3 count: -1 4 }
Then we can add a
$limit
stage to just have the top three airlines that are servicing Portland, Oregon:1 { 2 $limit: 3 3 }
After putting the documents in the
sample_training.routes
collection through this aggregation pipeline, our results show us that the top three airlines offering non-stop flights departing from PDX are Alaska, American, and United Airlines with 39, 17, and 13 flights, respectively.How does this look in code? It's fairly straightforward with using the
db.aggregate()
function. For example, in Python you would do something like:1 from pymongo import MongoClient 2 3 # Requires the PyMongo package. 4 # The dnspython package is also required to use a mongodb+src URI string 5 # https://api.mongodb.com/python/current 6 7 client = MongoClient('YOUR-ATLAS-CONNECTION-STRING') 8 result = client['sample_training']['routes'].aggregate([ 9 { 10 '$match': { 11 'src_airport': 'PDX', 12 'stops': 0 13 } 14 }, { 15 '$group': { 16 '_id': { 17 'airline name': '$airline.name' 18 }, 19 'count': { 20 '$sum': 1 21 } 22 } 23 }, { 24 '$sort': { 25 'count': -1 26 } 27 }, { 28 '$limit': 3 29 } 30 ])
The aggregation code is pretty similar in other languages as well.
The MongoDB aggregation framework is an extremely powerful set of tools. The processing is done on the server itself which results in less data being sent over the network. In the example used here, instead of pulling all of the documents into an application and processing them in the application, the aggregation framework allows for only the three documents we wanted from our query to be sent back to the application.
This was just a brief introduction to some of the operators available. Over the course of this series, I'll take a closer look at some of the most popular aggregation framework operators as well as some interesting, but less used ones. I'll also take a look at performance considerations of using the aggregation framework.