Language Analyzers
On this page
Use language-specific analyzers to create indexes tailored to a particular language. Each language analyzer has built-in stop words and word divisions based on that language's usage patterns.
Atlas Search offers the following language analyzers:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 cjk
is a generic Chinese, Japanese, and Korean analyzer
2 kuromoji
is a Japanese analyzer
3 morfologik
is a Polish analyzer
4 nori
is a Korean analyzer
5 smartcn
is a Chinese analyzer
Examples
Consider a collection named cars
with the following documents:
{ "_id": 1, "subject": { "en": "It is better to equip our cars to understand the causes of the accident.", "fr": "Mieux équiper nos voitures pour comprendre les causes d'un accident.", "he": "עדיף לצייד את המכוניות שלנו כדי להבין את הגורמים לתאונה." } }
{ "_id": 2, "subject": { "en": "The best time to do this is immediately after you've filled up with fuel", "fr": "Le meilleur moment pour le faire c'est immédiatement après que vous aurez fait le plein de carburant.", "he": "הזמן הטוב ביותר לעשות זאת הוא מיד לאחר שמילאת דלק." } }
Built-In Language Analyzer Example
The following example index definition specifies an index on
the subject.fr
field using the french
analyzer:
{ "mappings": { "fields": { "subject": { "fields": { "fr": { "analyzer": "lucene.french", "type": "string" } }, "type": "document" } } } }
The following query searches for the string pour
in the
subject.fr
field:
db.cars.aggregate([ { $search: { "text": { "query": "pour", "path": "subject.fr" } } }, { $project: { "_id": 0, "subject.fr": 1 } } ])
The previous query returns no results when using the french
analyzer,
because pour
is a built-in stop word. Using the standard
analyzer, the same query would return both documents.
The following query searches for the string carburant
in the
subject.fr
field:
db.cars.aggregate([ { $search: { "text": { "query": "carburant", "path": "subject.fr" } } }, { $project: { "_id": 0, "subject.fr": 1 } } ])
{ subject: { fr: "Le meilleur moment pour le faire c'est immédiatement après que vous aurez fait le plein de carburant." } }
Atlas Search returns a document with _id: 1
in the results because the query
matched a token that the lucene.french
analyzer created for the
document. The lucene.french
analyzer creates the following tokens
for the subject.fr
field in document with _id: 1
:
|
|
|
|
|
|
|
|
|
Custom Language Analyzer Example
You can also create indexes for unsupported languages by creating a custom analyzer with the icuFolding and stopword token filters.
The following example index definition specifies an index on
the subject.he
field using a custom analyzer called myHebrewAnalyzer
to analyze and create
tokens for Hebrew text:
{ "analyzer": "lucene.standard", "mappings": { "dynamic": false, "fields": { "subject": { "fields": { "he": { "analyzer": "myHebrewAnalyzer", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [], "name": "myHebrewAnalyzer", "tokenFilters": [ { "type": "icuFolding" }, { "tokens": [ "אן", "שלנו", "זה", "אל" ], "type": "stopword" } ], "tokenizer": { "type": "standard" } } ] }
The following query searches for the string המכוניות
in the
subject.he
field:
db.cars.aggregate([ { $search: { "text": { "query": "המכוניות", "path": "subject.he" } } }, { $project: { "_id": 0, "subject.he": 1 } } ])
{ subject: { he: 'עדיף לצייד את המכוניות שלנו כדי להבין את הגורמים לתאונה.' } }
Atlas Search returns a document with _id: 1
in the results because the
query matched a token that the myHebrewAnalyzer
analyzer created for
document. The myHebrewAnalyzer
analyzer creates the following tokens
for the subject.he
field in document with _id: 1
:
|
|
|
|
|
|
|
|
|
Multilingual Search Example
You can also create an index that uses multiple language analyzers to perform a multilingual search.
The following example index definition specifies an index with dynamic mapping
on the sample_mflix.movies
collection. The definition applies the lucene.italian
language analyzer to index the fullplot
field, and uses the multi
option to specify lucene.english
as an alternate language analyzer. Atlas Search uses the default
lucene.english
language analyzer for all other fields that it dynamically indexes in the
movies
collection.
{ "analyzer": "lucene.standard", "mappings": { "dynamic": true, "fields": { "fullplot": { "type": "string", "analyzer": "lucene.italian", "multi": { "fullplot_english": { "type": "string", "analyzer": "lucene.english", } } } } } }
The following Atlas Search query uses the following compound operator clauses to query the collection:
must
clause searches for movie plots in English and Italian that contain the termBella
using the text operatormustNot
clause excludes movies released between the years 1984 to 2016 using the range operatorshould
clause specifies preference for theComedy
genre using the text operator
[ { $search: { "index": "multilingual-tutorial", "compound": { "must": [{ "text": { "query": "Bella", "path": { "value": "fullplot", "multi": "fullplot_english" } } }], "mustNot": [{ "range": { "path": "released", "gt": ISODate("1984-01-01T00:00:00.000Z"), "lt": ISODate("2016-01-01T00:00:00.000Z") } }], "should": [{ "text": { "query": "Comedy", "path": "genres" } }] } } } ]
SCORE: 3.909510850906372 _id: "573a1397f29313caabce8bad" plot: "He is a revenge-obssessed stevedore whose sister was brutally raped an…" genres: 0: "Drama" runtime: 137 fullplot: "In Marseilles, a woman commits suicide after she is raped in an alley.…" released: 1983-05-18T00:00:00.000+00:00 SCORE: 3.4253346920013428 _id: "573a1396f29313caabce5735" plot: "Giovanna e' una bella ragazza, ma ha qualche problema con gli uomini: …" genres: 0: "Comedy" runtime: 100 fullplot: "Giovanna e' una bella ragazza, ma ha qualche problema con gli uomini: …" released: 1974-11-15T00:00:00.000+00:00 SCORE: 3.363344430923462 _id: "573a1395f29313caabce13cf" plot: "Gerardo è un attore o almeno cerca di esserlo, ma il pubblico non è de…" genres: 0: "Comedy" runtime: 95 fullplot: "Gerardo è un attore o almeno cerca di esserlo, ma il pubblico non è de…" released: 1960-02-10T00:00:00.000+00:00 SCORE: 1.9502882957458496 _id: "573a1396f29313caabce5299" plot: "Dr Tremayne is an enigmatic Psychiatrist running a Futuristic asylum h…" genres: 0: "Horror" runtime: 90 fullplot: "Dr Tremayne is an enigmatic Psychiatrist running a Futuristic asylum h…" released: 1973-10-31T00:00:00.000+00:00