Aggregate counts by day, months, year on a collection with an indexed timestamp field

Hi,

Can someone help me with this query?

[
  {
     // Match all docs, from bigbang to doomsday
     $match: {
      timestamp: {
         $gte: 0,
         $lt: 32472147600000
      }
    }
  },
  {
    $facet: {
      daily: [
        {
          $group: {
            _id: {
              year: { $year: { $toDate: "$timestamp" } },
              month: { $month: { $toDate: "$timestamp" } },
              day: { $dayOfMonth: { $toDate: "$timestamp" } }
            },
            count: { $sum: 1 }
          },
        },
        {
          $sort: {
            "_id.year": 1,
            "_id.month": 1,
            "_id.day": 1
          },
        },
        {
          $project: {
            _id: 0,
            year: "$_id.year",
            month: "$_id.month",
            day: "$_id.day",
            count: 1
          }
        }
      ],
    },
  },
]

The documents inside the collection are like this one:

{
  "deviceKey": "18:A:B",
  "state": {
    "rotation_x": 1.006,
    "rotation_y": -2.404
  },
  "device": {
    "physicalId": "B",
    "id": 418,
    "thing": {
      "physicalId": "A",
      "id": 108,
      "enabled": true,
      "network": {
        "id": 18
      }
    },
    "enabled": true
  },
  "timestamp": {
    "$numberLong": "1635109200000"
  },
}

As a start, I’d just convert the field to a date once, there is no reason to do it three times, convert it before you use it in the facet. I assume you have three grouping facets, so extract the day, month year BEFORE you hit the facet so all stages can share the computed values.

You’re also not going to be able to make use of proper date functions efficiently using a tick to store the time, I’d look at storing it as a proper date, do you use this query a lot?

That fact you have an index is of no use in this scenario, your’re calculating a value and that is not indexed.

But why does it say that the projection takes 378ms and the IXSCAN takes 1.2s, BUT the execution time takes 31 seconds?

Thanks for the other suggestions, I’ll implement them.

I tried to implement your suggestions (I think) but now I get an additional FETCH in the explain that makes things a slower

[
  {
    // Match all docs, from bigbang to doomsday
    $match: {
      timestamp: {
        $gte: 0,
        $lt: 32472147600000
      }
    }
  },
  {
    // Convert timestamp to date once and extract day, month, and year
    $addFields: {
      date: { $toDate: "$timestamp" },
      year: { $year: { $toDate: "$timestamp" } },
      month: { $month: { $toDate: "$timestamp" } },
      day: { $dayOfMonth: { $toDate: "$timestamp" } }
    }
  },
  {
    $facet: {
      daily: [
        {
          $group: {
            _id: {
              year: "$year",
              month: "$month",
              day: "$day"
            },
            count: { $sum: 1 }
          },
        },
        {
          $sort: {
            "_id.year": 1,
            "_id.month": 1,
            "_id.day": 1
          },
        },
        {
          $project: {
            _id: 0,
            year: "$_id.year",
            month: "$_id.month",
            day: "$_id.day",
            count: 1
          }
        }
      ],
    },
  },
]

Ideally you want to hit the groups with sorted data, in this case with the data in that form I’m not sure what other improvements you can make to be honest.
I’m tied up at the moment with work so can’t repro locally but from the screenshot you’re running this over 10M docs approx?
Computing and grouping on that many records is going to be slow, if this is a common query then perhaps look at an alternative storage, bucket pattern perhaps and then you can easily count and group at a higher level?

1 Like

Try projecting the timestamp right after the match.

Since I had other fields in the collection: year, month, day, hour, minute in addition to the raw timestamp, I changed the query to use these new fields (numerical). But what bothers me is that I always see a big difference between the sum of the times reported by the Explain and the total execution time. Do you know why is that? What is that hidden difference in time (33 seconds - 1.5 seconds)?

[
  {
    // Match all docs, from the BigBang to the Doomsday
    $match: {
      year: {
        $gte: 0,
        $lt: 802701
      }
    }
    // year: {
    //   $eq: 2023
    // },
    // month: {
    //   $eq: 08
    // }
  },
  {
    $facet: {
      daily: [
        {
          $group: {
            _id: {
              year: "$year",
              month: "$month",
              day: "$day"
            },
            count: { $sum: 1 }
          }
        },
        {
          $sort: {
            "_id.year": 1,
            "_id.month": 1,
            "_id.day": 1
          }
        },
        {
          $project: {
            _id: 0,
            year: "$_id.year",
            month: "$_id.month",
            day: "$_id.day",
            count: 1
          }
        }
      ],
    }
  }
]

I tried it, but it gets a lot worse.

This is the modified query:

[
  {
    "$match": {
      "timestamp": {
        "$gte": 0,
        "$lt": 32472147600000
      }
    }
  },
  {
    "$project": {
      "timestamp": 1
    }
  },
  {
    "$addFields": {
      "date": { "$toDate": "$timestamp" },
      "year": { "$year": { "$toDate": "$timestamp" } },
      "month": { "$month": { "$toDate": "$timestamp" } },
      "day": { "$dayOfMonth": { "$toDate": "$timestamp" } }
    }
  },
  {
    "$facet": {
      "daily": [
        {
          "$group": {
            "_id": {
              "year": "$year",
              "month": "$month",
              "day": "$day"
            },
            "count": { "$sum": 1 }
          }
        },
        {
          "$sort": {
            "_id.year": 1,
            "_id.month": 1,
            "_id.day": 1
          }
        },
        {
          "$project": {
            "_id": 0,
            "year": "$_id.year",
            "month": "$_id.month",
            "day": "$_id.day",
            "count": 1
          }
        }
      ]
    }
  }
]

one last suggestion before I give up.

in your first $project add _id:0 to end up with

"$project": {
      "timestamp": 1 ,
      "_id" : 0
    }