MongoDB Bytes #1: Aggregation Learning Activity

What is Aggregation?

Data aggregation is a process whereby data is gathered and expressed in a form that is easily comprehensible. Aggregation helps us in extracting insights and better analysis of our data.

What is an Aggregation Pipeline?

MongoDB’s Aggregation Framework helps us in processing multiple documents and returning computed results. Here are some use cases for aggregation operations:

  • Group values from multiple documents together.
  • Perform operations on the grouped data to return a single result.
  • Analyze data changes over time.
  • And many more!

Let’s explore our data model

For this activity we are going to use a very unique and interesting dataset that can give us a lot of space for creativity.

The dataset we are going to use contains information about deliveries(balls) that happened during IPL, an Annual Cricket League from 2008 to 2020.

I have downloaded a dataset from Kaggle, which I then imported into a database hosted on MongoDB Atlas. Our database has only one collection, named balls.

Each ball has the following keys and values describing all the information about the concerned delivery.

{
  _id: ObjectId("6297057ed8a6137af03a594a"),
  id: 335982,
  inning: 1,
  over: 11,
  ball: 1,
  batsman: "BB McCullum",
  non_striker: "RT Ponting",
  bowler: "SB Joshi",
  batsman_runs: 0,
  extra_runs: 0,
  total_runs: 0,
  non_boundary: 0,
  is_wicket: 0,
  dismissal_kind: "NA",
  player_dismissed: "NA",
  fielder: "NA",
  extras_type: "NA",
  batting_team: "Kolkata Knight Riders",
  bowling_team: "Royal Challengers Bangalore",
}

We will start by installing MongoDB Shell aka mongosh, which is a fully functional JavaScript and Node.js environment for interacting with MongoDB deployments. We will use the MongoDB Shell to test queries and operations directly with our database. Follow the steps mentioned in this guide to download, install and set up mongosh as per your operating system.

Once the download and setup are done, we will connect to a read-only cluster where I have loaded the IPL dataset:

mongosh “mongodb+srv://read_only:OxxrqgRYA1PFs0tf@cluster0.k52jp.mongodb.net/ipl”

Aggregation Pipelines in Action

How to group data based on the value of a field and calculate the sum?

As we just saw that every ball in the balls collection contains a field called extras_type and extra_runs, every time a bowler throws an illegal delivery like a Wide Ball, No Ball, Leg Byes, etc., the other team will be awarded some extra runs.

Whenever the delivery is legal the extra_runs field would contain 0 runs.

For this exercise, we want to find the distribution of all kinds of extra types by the total number of runs per extra type.

To achieve this, we have the following pipeline with these stages:

  • $match: filtering out the balls that are marked as extra, we are doing so by extracting all the balls where the extra_runs field is greater than 0
  • $group: totaling the runs by the extras_type
db.balls.aggregate([
  {
    $match: {
      extra_runs: { $gt: 0 },
    },
  },

  {
    $group: {
      _id: "$extras_type",

      totalRuns: {
        $sum: "$extra_runs",
      },
    },
  },
]);

How to sort and limit the number of documents after grouping?

Suppose we want to find the top 20 batsmen by the total number of runs scored by them.

Therefore, we can utilize the aggregation pipeline with the following stages to achieve the results:

  • $group: totaling(summing up) the runs scored by every single batsman
  • $sort: sorting the results in descending order on the basis of total runs scored by the batsman
  • $limit: limiting the number of results to 20.
db.balls.aggregate([
  {
    $group: {
      _id: "$batsman",

      totalRuns: {
        $sum: "$batsman_runs",
      },
    },
  },

  {
    $sort: {
      totalRuns: -1,
    },
  },

  {
    $limit: 20,
  },
]);

How to add a new calculated field to the documents?

Let’s take an example where we want to find the most efficient batsman who has played at least 300 balls.

Hint: Here, the most efficient batsman would be a batsman who has the best Strike-rate.

For example, a batsman who has scored 50 runs in 40 balls, their strike-rate would be: 125.

In the following pipeline,

  • $match: eliminating all the runs that were extra
  • $group: grouping all the balls played by the batsman and totaling the runs scored by them, along with summing up the number of balls played by them
  • $addFields: adding the strike-rate field by dividing the total runs by total balls and multiplying the result by 100
  • $sort: sorting the strike rate in descending order on the basis of total runs scored by the batsman
db.balls.aggregate([
  {
    $match: {
      extras_type: "NA",
    },
  },

  {
    $group: {
      _id: "$batsman",

      totalBalls: { $sum: 1 },

      totalRuns: { $sum: "$batsman_runs" },
    },
  },

  {
    $addFields: {
      strikeRate: {
        $multiply: [{ $divide: ["$totalRuns", "$totalBalls"] }, 100],
      },
    },
  },

  {
    $sort: {
      strikeRate: -1,
    },
  },
]);

11 Likes

Thanks for the information!