Aggregation with indexed date field not working

Patrick_DEMONGUERES · May 20, 2024, 1:07pm

Hello,

having some collections with an index on a created_by.date field, and some metrics using aggregation looking for last hour documents, like this :

"command": {
"aggregate": "recipient",
"pipeline": [
{
"$match": {
"$expr": {
"$lt": [
"created_by.date",
{
"$subtract": [
"$$NOW",
3600000
]
}
]
}
}
},

But the index is not used, besides Mongo is with 7.0 version (>5.0).

I noticed that if I replace the “subtract” part with a hardcoded date, index is correctly used.

Any idea why?

steevej · May 22, 2024, 1:15am

I am surprised that the query works at all.

When inside an $expr, fields must have a $, so you should have $created_by.date rather than simply created_by.date. Because, created_by.date is simply a string rather an field path, I am not surprised that the index is not used. You are not querying a field. Here is some code that shows the differences:

What puzzle me however is when you write

It would be nice to see the exact pipeline you use and its explain plan in order to fully understand. One thing I could think is that since you do not call $substract you also removed the $expr and wrote the match as

Which will provide a valid result and be an IXSCAN.

steevej · May 28, 2024, 12:13pm

@Patrick_DEMONGUERES, could you please provide followup on your issue.

Patrick_DEMONGUERES · May 29, 2024, 12:13am

Hi @steevej , sorry for the delay for responding.

Here are more details with another example.

I got a collection with 1.6M documents, each document having a field creationDate
there is an index created on creationDate

Here is a simple pipeline for an aggregation, targetting documents created in the last 4 days :

[
  {
    $match: {
      $expr: {
        $gte: [
          "$creation_date",
          {
            $subtract: ["$$NOW", 345600000]
          }
        ]
      }
    }
  },
  {
    $group: {
      _id: {
        creation_date: {
          $dateToString: {
            format: "%Y-%m-%d",
            date: "$creation_date"
          }
        }
      },
      total: {
        $sum: 1
      }
    }
  },
  {
    $project: {
      _id: 0,
      creation_date: "$_id.creation_date",
      total: {
        $toInt: "$total"
      }
    }
  }
]

The index is not used :

But modifing the stage “$match” with an explicit date, like this

{
    $match: {
      $expr: {
        $gte: [
          "$creation_date",
          ISODate("2024-05-24T17:11:00.000+00:00")
        ]
      }
    }
  }

Index is well used

Patrick_DEMONGUERES · May 29, 2024, 12:36pm

Completing my previous post : it seems that using $$NOW is the reason why index on creation_date is not used.

Using $$NOW :

  {
    $match: {
      $expr: {
        $gte: [
          "$creation_date", {
            $subtract: ["$$NOW", 345600000]
          }
        ]
      }
    }
  }

Without $$NOW :

  {
    $match: {
      $expr: {
        $gte: [
          "$creation_date", {
            $subtract: [ISODate("2024-05-29T17:11:00.000+00:00"),345600000]
          }
        ]
      }
    }
  }

steevej · May 29, 2024, 1:06pm

That is a very insightful observation.

I cannot work on this for now. I will come back later.

Patrick_DEMONGUERES · May 29, 2024, 1:53pm

Here is a bug related to my issue : https://jira.mongodb.org/browse/SERVER-83524
Thanks to Hermann Baumgartl, who opened another post with similar issue.

steevej · May 29, 2024, 8:42pm

Based on the observation that $$NOW is the culprit, here is a workaround that should leverage the creation_date index.

Rather than aggregating on the main collection you start with:

I could not verify that the $match uses the index, mostly because I get a the error
Expected 'aggregate' to be string, but got int32 instead. when I do db.aggregate(…).explain. But by storing now_minus_delta in a temporary collection and aggregate on the temp. collection I got the following explain for the $lookup

I might be worth trying this workaround on your dataset until the jira is resolved.

Patrick_DEMONGUERES · May 31, 2024, 3:46pm

I’m trying to apply your suggestion : I created a specific collection to store documents like

//collection's name is "workaround"
{_id:"1h", ms:3600000}
{_id:"1d", ms:86400000}
{_id:"4d", ms:345600000}

Then I started a pipeline on my collection workaround like this …

[
  {
    $match:
      {
        _id: "4d"
      }
  },
  {
    $project:
      {
        _id: 0,
        ms: 1,
        threasholddate: {
          $subtract: ["$$NOW", "$ms"]
        }
      }
  }
]

but then I don’t know how to use $lookup to make my search on my targeted collection…

steevej · May 31, 2024, 6:30pm

That is one way to do it.

[
  {
    $match:
      {
        _id: "4d"
      }
  },
  {
    $project:
      {
        _id: 0,
        ms: 1,
        threasholddate: {
          $subtract: ["$$NOW", "$ms"]
        }
      }
  } ,
  { 
        $lookup : {
            from: 'main_connection',
            as: '_tmp',
            let: { threshold: '$threasholddate' },
            pipeline: [
                { '$match': {
                    '$expr': { '$gte': [ '$creation_date', '$$threshold' ] }
                } } ,
                { $group: {
                    _id: {
                       creation_date: {
                           $dateToString: {
                               format: "%Y-%m-%d",
                               date: "$creation_date"
                            }
                       }
                    },
                    total: {   $sum: 1    }
                } }
         ] 
]

Patrick_DEMONGUERES · June 1, 2024, 5:51am

Thanks @steevej , it works well, besides the explain does not show the use of the creation_date index. But the usage’s count of this index increases, and I got an instant result : so, clearly, this pipeline is a good trick!

Patrick_DEMONGUERES · June 1, 2024, 7:32am

Here is the steps to rewrite the pipeline :

create a collection named workaround
insert documents with the value you want to use :

{_id:"1h", ms:3600000}
{_id:"1d", ms:86400000}
{_id:"4d", ms:345600000}

rewrite your pipeline from the workaround collection, with this stages
** stage 1 : $match to target your duration (4d? 1h?)
** stage 2 : $project to calculate your threashold date from $subtract and $$NOW
** stage 3 : $lookup to build a _tmp array, from stage 2 and the main collection. Use your original pipeline as a sub-pipeline in this stage
** stage 4 : $unwind items in the _tmp array
** stage 5 : $project to print the attended result

Here is the rewrite for what I posted here : Aggregation with indexed date field not working - #4 by Patrick_DEMONGUERES

[
  {
    $match: {
      _id: "4d"
    }
  },
  {
    $project: {
      _id: 0,
      ms: 1,
      threasholddate: {
        $subtract: ["$$NOW", "$ms"]
      }
    }
  },
  {
    $lookup: {
      from: "main_collection",
      as: "_tmp",
      let: {
        threshold: "$threasholddate"
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $gte: [
                "$creation_date",
                "$$threshold"
              ]
            }
          }
        },
        {
          $group: {
            _id: {
              creation_date: {
                $dateToString: {
                  format: "%Y-%m-%d",
                  date: "$creation_date"
                }
              }
            },
            total: {
              $sum: 1
            }
          }
        }
      ]
    }
  },
  {
    $unwind: {
      path: "$_tmp"
    }
  },
  {
    $project: {
      _id: 0,
      creation_date: "$_tmp._id.creation_date",
      total: {
        $toInt: "$_tmp.total"
      }
    }
  }
]

system · June 6, 2024, 7:32am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.