Atlas Search "text": low performance when running a large number of queries at once

edgar_chen · January 13, 2025, 11:27am

Hi Community!

I’m building a CAT tool (Computer-aided Translation) that involves doing fuzzy matches between an input file and a translation memory (database of previously translated strings). The input file is a list of String documents, and the translation memory contains a list of TranslationUnit documents. My tier is M10 (the basic one). My approach is as follows:

Run aggregate on String, use $lookup to cross check key and source.text of String and TranslationUnit, then find those with the same key and source.text for 101% matches, and those with different key but same source.text as 100% matches. In the “as” attribute of $lookup, these matches are added to 101match and 100match fields respectively.
Check the array length of 101match and 100match of returned documents, if the length >0, add a new field matchScore and assign 101 or 100. If length = 0, assign -1 to matchScore.
Since $search doesn’t support the $let keywords, I cannot perform fuzzy searches with in the same aggregate command. Instead, I filter out all String documents with “matchScore === -1” and pass each of them to a new aggregate command where the pipeline contains $search. This is where I have to send a really big amount of ocncurrent queries, and the response always times out with 60K String documents (all fuzzy) and 220K TranslationUnit documents.
`async function findFuzzy(string) {
let pipeline = [
{
$search: {
// find fuzzy strings in source text
index: “tmSourceTextOnly”,
text: {
query: string.source.text,
path: “source.text”,
// fuzzy: { maxEdits: 1 }
}
```
                          }
                      },
                      // check if these fuzzy source has the current target translation
                           {
                               $match: {
                                   translations: {
                                       $elemMatch: {
                                           lang: taskTargetLang  
                                       }
                                   },
                                   parentTranslationMemory: {
                                       $in: projectTranslationMemories
                                   }
                               }
                           },
                     
                      
                      {
                          // 1 is enough for creating analyses
                          $limit: 1
                      }
                  
  ];
  const result = await TranslationUnit.aggregate(pipeline);
  string['fuzzyMatch'] = result;
  console.log('fuzzy single string result', result);
  return string
```
}

let fuzzyPromises =

for (let string of allMatches.fuzzyMatch) {
fuzzyPromises.push(findFuzzy(string))
}
console.log(‘finding fuzzy matches…’, fuzzyPromises.length);
allMatches.fuzzyMatch = await Promise.all(fuzzyPromises);`
In comparison, if the file is mostly made of 101% and 100% matches, the whole process could take 30s or even 20s.
Indexes and Search Indexes are in place for all related fields, so I assume I’m not missing anything important.

I think my key question here is: is this the right way to use $search? If not, is there a plausible solution for my use case I can look into? I really hope I can achieve this with Atlas Search because of how convenient it is, and if I have to implement fuzzy matching outside of Atlas Search, then the M10 subscription may no longer be worth it.

Thank you!