Avoid collection(index) scan for search index queries. Database index query performace optimizations for search indexes

Octavio_Araiza · February 24, 2023, 5:47pm

Is it possible to use the $search aggregation pipeline stage to efficiently filter out documents with some field not containing, for example, a certain object id, and then perform full-text search on other fields? By efficient I mean that the documents with the desired object id in the given field can be found without scanning all indexed data, similar to how a database index on the given field would avoid a collection scan.

To better explain my question, I’ll present a simplified scenario.

Consider a stores collection and a products collection. The document schema for both collections is as follows:

// Stores schema:
{
    _id: ObjectId,
    name: String
}

// Products schema:
{
    _id: ObjectId,
    name: String,
    store: ObjectId
}

Every product has a name and belongs to a store.

Consider an application where the user is able to choose a store, and then full-text seach for products in that store by name.

To achieve this, I’d create the following search index:

{
    collectionName: 'products',
    name: 'productName_index',
    mappings: {
        dynamic: false,
        fields: {
            store: {
                type: "objectId",
            },
            name: [
                { type: "string" },
                { type: "autocomplete" }
            ]
        }
    }
}

And use the following aggregation pipeline to query:

 // Known store _id
const storeId = new ObjectId()

const searchQuery = "someProductName"

const pipeline = {
    $search: {
        index: "productName_index",
        compound: {
            filter: [
                { equals: {
                    path: "store",
                    query: storeId
                }}
            ],
            should: [
                { text: {
                    path: "name",
                    query: searchQuery
                }},
                { autocomplete: {
                    path: "name",
                    query: searchQuery
                }}
            ],
            minimumShouldMatch: 1
        }
    }
}

I think that for this query, all indexed data for the productsName_index is scanned.

If instead, I were to use a compound database index: { store: 1, name: 1 }, I could use an aggregation pipeline with a $match stage to filter out products that do not belong to a store, without performing a collection scan. But then, I would no longer be able to full-text search.

So then, how does it work with search indexes? Would the above query have to check every indexed store field? If so, I’m curious if it’d ever be possible to build a search index that supports this kind of queries more efficiently.

Jason_Tran · April 18, 2023, 3:24am

Hi @Octavio_Araiza,

Apologies as i’m a bit confused with the title as you have stated “collection(index) scan” - Can you clarify what you are referring to here? A collection scan indicates that the mongod had to scan the entire collection document by document to identify the results.

In terms of the $search pipeline, the use of filter helps reduce the amount of documents that the mongod needs to fetch. As per the Atlas Search Query Performance documentation:

Using a $match aggregation pipeline stage after a $search stage can drastically slow down query results. If possible, design your $search query so that all necessary filtering occurs in the $search stage to remove the need for a $match stage. The Atlas Search compound operator is helpful for queries that require multiple filtering operations.

Although I understand you’ve noted this is a simplified scenario - I’m wondering just for context on this topic, is the pipeline you’ve provided executing in an abnormal or unexpected time frame? If so, please provide some further details regarding this if possible.

Regards,
Jason

Octavio_Araiza · May 5, 2023, 8:11am

Regarding “collection(index) scan” in the title; I’m not sure how search index queries work, but I’m assuming that when the filter stage is used (e.g., to only retrieve documents where some field’s value equals some ObjectId), all the indexed data have to be ‘scaned’, which I think would be — in a practical way — similar to a collection scan for a database index.

Then, I understand the filter stage is more performant with respect to using $match after $search, but I’d like to know if, for the presented case, all indexed data is ‘scanned’.

The pipeline is not executing in an abnormal or unexpected time frame. There’s no problem with the pipeline itself. I’m just curious about the filter stage .

Jason_Tran · May 9, 2023, 6:13am

It could possibly be my misunderstanding / interpretation of your statements above but I do not believe the filter stage works in the same manner as a collection scan. To my knowledge, the _id’s of the matching documents returned from the search process are passed to the mongod process for the documents to be retrieved so a collection scan is not performed here.

Regards,
Jason