Docs Menu

Docs HomeDevelop ApplicationsPython DriversPyMongo

Group and Total

On this page

  • Introduction
  • Aggregation Task Summary
  • Before You Get Started
  • Tutorial
  • Add a match stage for orders in 2020
  • Add a sort stage to sort by order date
  • Add a group stage to group by email address
  • Add a sort stage to sort by first order date
  • Add a set stage to display the email address
  • Add an unset stage to remove unneeded fields
  • Run the aggregation pipeline
  • Interpret results

In this tutorial, you can learn how to use PyMongo to construct an aggregation pipeline, perform the aggregation on a collection, and print the results by completing and running a sample app. This aggregation performs the following operations:

  • Matches a subset of documents by a field value

  • Groups documents by common field values

  • Adds computed fields to each result document

This tutorial demonstrates how to group and analyze customer order data. The results show the list of customers who purchased items in 2020 and includes each customer's order history for 2020.

This example uses one collection, orders, which contains documents describing individual product orders. Since each order can correspond to only one customer, the order documents are grouped by the customer_id field, which contains customer email addresses.

Before you start this tutorial, complete the Aggregation Template App instructions to set up a working Python application.

After you set up the app, access the orders collection by adding the following code to the application:

orders_coll = agg_db["orders"]

Delete any existing data and insert sample data into the orders collection as shown in the following code:

orders_coll.delete_many({})
order_data = [
{
"customer_id": "elise_smith@myemail.com",
"orderdate": datetime(2020, 5, 30, 8, 35, 52),
"value": 231
},
{
"customer_id": "elise_smith@myemail.com",
"orderdate": datetime(2020, 1, 13, 9, 32, 7),
"value": 99
},
{
"customer_id": "oranieri@warmmail.com",
"orderdate": datetime(2020, 1, 1, 8, 25, 37),
"value": 63
},
{
"customer_id": "tj@wheresmyemail.com",
"orderdate": datetime(2019, 5, 28, 19, 13, 32),
"value": 2
},
{
"customer_id": "tj@wheresmyemail.com",
"orderdate": datetime(2020, 11, 23, 22, 56, 53),
"value": 187
},
{
"customer_id": "tj@wheresmyemail.com",
"orderdate": datetime(2020, 8, 18, 23, 4, 48),
"value": 4
},
{
"customer_id": "elise_smith@myemail.com",
"orderdate": datetime(2020, 12, 26, 8, 55, 46),
"value": 4
},
{
"customer_id": "tj@wheresmyemail.com",
"orderdate": datetime(2021, 2, 28, 7, 49, 32),
"value": 1024
},
{
"customer_id": "elise_smith@myemail.com",
"orderdate": datetime(2020, 10, 3, 13, 49, 44),
"value": 102
}
]
orders_coll.insert_many(order_data)
1

First, add a $match stage that matches orders placed in 2020:

pipeline.append({
"$match": {
"orderdate": {
"$gte": datetime(2020, 1, 1, 0, 0, 0),
"$lt": datetime(2021, 1, 1, 0, 0, 0)
}
}
})
2

Next, add a $sort stage to set an ascending sort on the orderdate field to surface the earliest 2020 purchase for each customer in the next stage:

pipeline.append({
"$sort": {
"orderdate": 1
}
})
3

Add a $group stage to group orders by the value of the customer_id field. In this stage, add aggregation operations that create the following fields in the result documents:

  • first_purchase_date: the date of the customer's first purchase

  • total_value: the total value of all the customer's purchases

  • total_orders: the total number of the customer's purchases

  • orders: the list of all the customer's purchases, including the date and value of each purchase

pipeline.append({
"$group": {
"_id": "$customer_id",
"first_purchase_date": {"$first": "$orderdate"},
"total_value": {"$sum": "$value"},
"total_orders": {"$sum": 1},
"orders": {"$push": {"orderdate": "$orderdate", "value": "$value"}}
}
})
4

Next, add another $sort stage to set an ascending sort on the first_purchase_date field:

pipeline.append({
"$sort": {
"first_purchase_date": 1
}
})
5

Add a $set stage to recreate the customer_id field from the values in the _id field that were set during the $group stage:

pipeline.append({
"$set": {
"customer_id": "$_id"
}
})
6

Finally, add an $unset stage. The $unset stage removes the _id field from the result documents:

pipeline.append({"$unset": ["_id"]})
7

Add the following code to the end of your application to perform the aggregation on the orders collection:

aggregation_result = orders_coll.aggregate(pipeline)

Finally, run the following command in your shell to start your application:

python3 agg_tutorial.py
8

The aggregation returns the following summary of customers' orders from 2020:

{
'first_purchase_date': datetime.datetime(2020, 1, 1, 8, 25, 37),
'total_value': 63,
'total_orders': 1,
'orders': [ { 'orderdate': datetime.datetime(2020, 1, 1, 8, 25, 37), 'value': 63 } ],
'customer_id': 'oranieri@warmmail.com'
}
{
'first_purchase_date': datetime.datetime(2020, 1, 13, 9, 32, 7),
'total_value': 436,
'total_orders': 4,
'orders': [
{ 'orderdate': datetime.datetime(2020, 1, 13, 9, 32, 7), 'value': 99 },
{ 'orderdate': datetime.datetime(2020, 5, 30, 8, 35, 52), 'value': 231 },
{ 'orderdate': datetime.datetime(2020, 10, 3, 13, 49, 44), 'value': 102 },
{ 'orderdate': datetime.datetime(2020, 12, 26, 8, 55, 46), 'value': 4 }
],
'customer_id': 'elise_smith@myemail.com'
}
{
'first_purchase_date': datetime.datetime(2020, 8, 18, 23, 4, 48),
'total_value': 191,
'total_orders': 2,
'orders': [
{ 'orderdate': datetime.datetime(2020, 8, 18, 23, 4, 48), 'value': 4 },
{ 'orderdate': datetime.datetime(2020, 11, 23, 22, 56, 53), 'value': 187 }
],
'customer_id': 'tj@wheresmyemail.com'
}

The result documents contain details from all the orders from a given customer, grouped by the customer's email address.

To view the complete code for this tutorial, see the Completed Group and Total App on GitHub.

← Filtered Subset