Time series queries performance is poor

Mihai_Oprescu · November 21, 2024, 12:42pm

Hello,
I have a time-series use case question…

I have a Reviews collection which contains 370k records. the reviews collection contains data such as:

companyid: 'company-a',
reviewedDate: new Date(...),
comment: 'etc',
score: 3

To demonstrate my issue, I have a simple aggregation that calculates an average of all Review rows.

#1 - Have 1 big time-series collection for all companies (370k records), and use timeField: reviewedDate and metaField: companyid. When I query this collection, the average is around 885ms.

#2 - Have a separate, regular collection for each company. When i query this collection with the same aggregation, the average is around 230ms.

Why am I confused: Given that I am providing a metaField as companyid, I am expecting the time series collection to perform with the same performance as the individual company collections.

This is my aggregation query:

[
  {
    "$match": {
      "companyid": "c8388ea5-a33b-4c54-b19d-557c062aa31e",
      "reviewedDate": {
        "$gt": "2024-08-21T04:00:00.000Z"
      },
      "status": "Reviewed",
      "enabled": true
    }
  },
  {
    "$group": {
      "_id": null,
      "averageScore": {
        "$avg": "$score"
      },
      "count": {
        "$sum": 1
      },
      "reviewId": {
        "$addToSet": "$id"
      }
    }
  },
  {
    "$project": {
      "_id": 0,
      "dateRange": {
        "start": "1724212800"
      },
      "score": {
        "$round": [
          {
            "$multiply": [
              "$averageScore",
              10
            ]
          }
        ]
      },
      "reviews": "$count"
    }
  }
]

Here are the results:

Okay, this is the end of the first question.

The second part… I tried to compare the performance of a single regular collection vs a single time-series collection:

Single regular collection with the same aggregation query came in at an average of 885ms.
Time-series collection with the same aggregation query came in at an average of 780ms.
The performance difference between the two is dissappointing and I am wondering if I am doing something wrong?

Given these results, I am inclined to believe that it is better to separate collections per company, which makes it much more difficult to update and work with - I really thought that metaField would index these companies’ data in the background.

What have I done to see if I am doing something wrong:

When I bulk insert the data into the time-series collection, I sort in ascending order
The # of elements that are actually used come to 168 (out of 370k records), so 880ms seems high
I am using the latest Mongo version (8 on the Mongo Cloud, and dependency 6.0 on my NodeJS application)

This is my time-series single collection configuration:

What do you recommend I try differently?

Thank you very much,
Mihai

Jennysson_Junior · November 22, 2024, 4:12pm

Hi @Mihai_Oprescu welcome back!

First, it’s important to mention your observations:

Per-Company Collections: When you use separate collections for each company, you’re effectively isolating your dataset, minimizing the scope of the query, and potentially benefiting from better query planning and execution. This is why you see faster performance (~230ms) for these queries.
Single Time-Series Collection: While time-series collections offer storage and query optimizations for time-series workloads, the presence of metadata fields like companyid does not automatically partition or filter the data during query execution. The metadata field helps group related data logically and is used in bucket creation for storage optimization. However, for efficient query execution, you must explicitly define an index on the metadata field (companyid) combined with the time field (reviewedDate) or other fields you frequently filter on.
For example, a compound index like { companyid: 1, reviewedDate: 1 } ensures that MongoDB can quickly narrow down the relevant documents based on both the company and the time range, significantly improving query performance. Without this index, the query must scan more documents or buckets than necessary, leading to slower execution times.

Now, would be great if you share the Explain results from your aggregation query that you can get using Compass. This will give detailed insights into how MongoDB processes your pipeline, like:

The number of documents scanned and returned at each stage.
Execution time for each stage.
Whether indexes are being used effectively.

With this information, we can help you identify specific bottlenecks and suggest targeted improvements to optimize your query performance.