Hi Community!
I’m building a CAT tool (Computer-aided Translation) that involves doing fuzzy matches between an input file and a translation memory (database of previously translated strings). The input file is a list of String documents, and the translation memory contains a list of TranslationUnit documents. My tier is M10 (the basic one). My approach is as follows:
-
Run aggregate on String, use $lookup to cross check
key
andsource.text
of String and TranslationUnit, then find those with the samekey
andsource.text
for 101% matches, and those with differentkey
but samesource.text
as 100% matches. In the “as” attribute of $lookup, these matches are added to 101match and 100match fields respectively. -
Check the array length of 101match and 100match of returned documents, if the length >0, add a new field matchScore and assign 101 or 100. If length = 0, assign -1 to matchScore.
-
Since $search doesn’t support the $let keywords, I cannot perform fuzzy searches with in the same aggregate command. Instead, I filter out all String documents with “matchScore === -1” and pass each of them to a new aggregate command where the pipeline contains $search. This is where I have to send a really big amount of ocncurrent queries, and the response always times out with 60K String documents (all fuzzy) and 220K TranslationUnit documents.
`async function findFuzzy(string) {
let pipeline = [
{
$search: {
// find fuzzy strings in source text
index: “tmSourceTextOnly”,
text: {
query: string.source.text,
path: “source.text”,
// fuzzy: { maxEdits: 1 }
}} }, // check if these fuzzy source has the current target translation { $match: { translations: { $elemMatch: { lang: taskTargetLang } }, parentTranslationMemory: { $in: projectTranslationMemories } } }, { // 1 is enough for creating analyses $limit: 1 } ]; const result = await TranslationUnit.aggregate(pipeline); string['fuzzyMatch'] = result; console.log('fuzzy single string result', result); return string
}
let fuzzyPromises =
for (let string of allMatches.fuzzyMatch) {
fuzzyPromises.push(findFuzzy(string))
}
console.log(‘finding fuzzy matches…’, fuzzyPromises.length);
allMatches.fuzzyMatch = await Promise.all(fuzzyPromises);`
In comparison, if the file is mostly made of 101% and 100% matches, the whole process could take 30s or even 20s.
Indexes and Search Indexes are in place for all related fields, so I assume I’m not missing anything important.
I think my key question here is: is this the right way to use $search? If not, is there a plausible solution for my use case I can look into? I really hope I can achieve this with Atlas Search because of how convenient it is, and if I have to implement fuzzy matching outside of Atlas Search, then the M10 subscription may no longer be worth it.
Thank you!