Standard analyzer are slow but keyword analyzers don't serve the purpose

Salina_Koirala · May 23, 2024, 6:22am

Hello folks, I’m new to this topic, so sorry, if this is a stupid question.
I have a field “address” that has leucene.standard as both index analyzer and search analyzer.
The problem is if the search query has lots of whitespace, it’s taking a lot of time.
For search query

{
     "$search": {
        "index": "address_book",
            "text": {
                "path": "address",
                "query": "John Lake Shore Drive 4237, Texas, USA",
 }
}

Since, the USA and Texas are more common terms it’s getting long time to respond.

However, if I try leucene.keyword for both index analyzer and search analyzer, I don’t get results if my query is “John Lake” or “John Lake Shore Drive” or “Shore Drive”.

What can I do in this case? Can I use leucene.keyword as index analyzer and leucene.standard as search analyzer to get the results?
Will it be faster and generate results for above mentioned cases?
Using different analyzers is not recommended as per the documentation? WHat sorts of problems will I be looking into?

Erik_Hatcher · May 23, 2024, 1:17pm

Could you share more about your environment? How many documents are in the collection? What is your index configuration?

Analyzers are leveraged to match your use case, and in this case it looks like you’re after word-level matching and thus would not want to use lucene.keyword but rather lucene.standard and/or lucene.english.

Speed shouldn’t be an issue with any analyzer choice - so let’s drill down on that situation with your index configuration and number of documents. Can you share a sample document or two as well?

Finally, where are you seeing that using different analyzers is not recommended? Most definitely multiple (via the multi feature) is quite useful and recommended when different types of matching and relevancy tuning is warranted.

Salina_Koirala · May 24, 2024, 8:35am

I have total of 104115949 documents in the collection currently.
Here’s one sample document:

 {
"_id":"664f439c55b5d02b06fe7adb",
"content":{
  "valid":true,
   "orig_value":"51.6",
    "value": "51.6",
   "page":0,
   "position":[23,45,78,21],
},
"doc_id":"f55b9630d1a2400dbb4492f04a95d7a7",
"ignore":false,
"label":"Average",
"address": "John Lake Shore Drive 4237, Texas, USA"
"low_confidence":false,
"f_id":1233,
"p_id":1234001,
"time_spent":0,
"type":"number",
"f_title":"contact_info",
"f_type":"line_item",
"p_title":"Owner"
,"p_type":"Line Items",
"rows":3"
"grid_id":"9f15f4e55cbf4b1cacadcd6fc9573f22"
}

Here’s my index configuration:

 {
  "mappings": {
    "dynamic": false,
    "fields": {
      "address":{
            "type": "string",
        },
      "label": {
        "analyzer": "lucene.keyword",
        "searchAnalyzer": "lucene.keyword",
        "type": "string"
      }
    }
  },
  "storedSource": {
    "include": [
      "doc_id"
    ]
  }
}

Also little bit of side info: I sometimes use filters for label field in my search query as well. But my query is slow for searching for address only, let alone the filter.

To the last question about mix and match of analyzer not being recommended, I didn’t see lots of articles, answers in this forum as well as stack overflow with the mix and match of different analyzer and searchAnalyzer. Then I posted my query on chatGPT (this might have been a mistake which now I realize).

Erik_Hatcher · May 28, 2024, 2:25pm

Some suggestions:

Use token field type for your “label” field - it supports equals and is a cleaner way to do exact filtering than using the keyword analyzer.
See Relevant As-You-Type Suggestions Search Solution | MongoDB and the associated github repo (link at the bottom of the article) for some examples of how to build a rich compound.should query that leverages different types of analysis and query types.

The slowness you report (can you share some actual numbers?) could be attributed to the text operator and the long query, which will turn into a boolean OR of a bunch of term queries underneath. Leveraging the phrase operator in conjunction with text and perhaps leveraging shingles could would likely help with faster and better relevancy.

Salina_Koirala · May 29, 2024, 3:45am

My pipeline looks like this:

pipeline = [
        {
            "$search": {
                "index": "address_book",
                "returnStoredSource": True,
                "concurrent": True,
              "text": {
                "query": "Matson Navigation Company INC",
               "path": "address",
               "fuzzy": {"maxEdits": 1.0},
            }
        },
        {
            "$lookup": {
                "from": "doc_metadata",
                "localField": "doc_id",
                "foreignField": "doc_id",
                "as": "doc",
                "pipeline": [
                    {"$match": {"$expr": {"$eq": ["$org_id", org_id]}}},
                    {"$project": {"_id": 0, "doc_metadata_id": "$_id", "folder_id": 1}},
                ],
            }
        },
        {"$match": {"doc": {"$ne": []}}},
        {"$sort": {"doc_metadata_id": DESCENDING}},
        {"$limit": max_aggregation_limit},
        {"$group": {"_id": "$doc_id", "doc": {"$first": "$doc"}}},
        {"$project": {"_id": 0, "doc_id": "$_id", "doc.folder_id": 1}},
    ]

OR

pipeline = [
        {
            "$search": {
               "index": "address_book",
               "returnStoredSource": True,
               "concurrent": True,
               "compound": {
                "must": [
                 {
                  "text": {
                     "query": "Matson Navigation Company INC",
                     "path": "address",
                     "fuzzy": {"maxEdits": 1.0},
                   }
                  }
                ],
                "filter": [{"text": {"query": "Average", "path": "label"}}],
            }
        },
        {
            "$lookup": {
                "from": "doc_metadata",
                "localField": "doc_id",
                "foreignField": "doc_id",
                "as": "doc",
                "pipeline": [
                    {"$match": {"$expr": {"$eq": ["$org_id", org_id]}}},
                    {"$project": {"_id": 0, "doc_metadata_id": "$_id", "folder_id": 1}},
                ],
            }
        },
        {"$match": {"doc": {"$ne": []}}},
        {"$sort": {"doc_metadata_id": DESCENDING}},
        {"$limit": max_aggregation_limit},
        {"$group": {"_id": "$doc_id", "doc": {"$first": "$doc"}}},
        {"$project": {"_id": 0, "doc_id": "$_id", "doc.folder_id": 1}},
    ]

And the first one took 127s to generate results when my limit was 1000.

Appreciate the suggestions, will try those out as well.

Erik_Hatcher · May 29, 2024, 12:46pm

Ah, the situation is clear with your full pipeline. Every stage after $search consumes the full set of matching documents. Not using a $match after $search is my first recommendation here - move that matching condition into $search (as a compound.filter, ideally). Also, leverage the sorting option within $search, rather than the $sort stage. And also, consider using the facet option of $search instead of a $group - if that provides you the information you need. Though I see that you’re pulling information from $lookup for sorting - and that’s going to be a performance hit if your result set is large.

The short answer is - do as much in $search as possible, and consider using a $limit immediately after $search so that successive stages are not consuming a large number of documents.

Given the additional $lookup information - perhaps it is best to consider data modeling a flatter/pre-joined collection for searching so that a $lookup and $group aren’t needed.

Salina_Koirala · May 29, 2024, 2:08pm

We are basically using match stage to achieve the inner join operation. And the sort stage after that is something we couldn’t move before lookup.

I will try using $facet instead of $group though (that was just to get unique results).

However, for us query like Matson Navigation doesn’t take much time. But when the queries are larger like Matson Navigarion Company INC the results are slower (this is why we moved our focus on improving our analyzers). We basically want results that match Matson Navigation Company INC and not results that match only Company That basically happens with the standard analyzer.
Also using keyword analyzer, we don’t get results for Matson Navigation Company because INC is missing.

Erik_Hatcher · May 29, 2024, 2:53pm

To clarify, I mean the facet operator within $search, not the aggregation $tage by the same name.

Understood about your desired matching requirements. The text operator OR’s all of the terms in the query, and so it is matching any of the terms. I think you can get some inspiration from the link to the “as-you-type” solution and the github repo and configuration there. Using the phrase operator is advisable, in conjunction with other types of analysis (via multi) that would shingle the text so that you get 2 or 3 word shingle matching. It takes some fiddling to get the sweet spot of analysis and query generation (there’s not just one operator that will serve your use case best - but rather a combination of several compound.should nested operators of text and phrase across various multi-analyzed field, and tune there for good matching relevancy, and minimize the spurious not-good matches from coming through by having the best ones in the top 10 or so, and then a $limit can really help prune when the top ones are the best ones.

Erik_Hatcher · May 29, 2024, 3:08pm

The lookup “inner join” following $search is an an interesting data modeling design - it works to a point, but doesn’t scale to larger result sets. Using MongoDB necessitates adapting the document and collections structures to suit the use cases. It is not uncommon that data gets duplicated in different structures, pre-aggregated, etc to best optimize the reads and writes to scale. Your situation seems to be one that could do well with bringing in the inner-joined data into the main $search-ed collection (in addition to where it currently resides as well, to suit other use cases). Food for thought, and sometimes much easier said than done - but one worth considering the pros and cons to various schema designs.

system · June 25, 2024, 12:14pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.