Token Filters
On this page
- asciiFolding
- Attributes
- Example
- daitchMokotoffSoundex
- Attributes
- Example
- edgeGram
- Attributes
- Example
- englishPossessive
- Attributes
- Example
- flattenGraph
- Attributes
- Example
- icuFolding
- Attributes
- Example
- icuNormalizer
- Attributes
- Example
- kStemming
- Attributes
- Example
- length
- Attributes
- Example
- lowercase
- Attributes
- Examples
- nGram
- Attributes
- Example
- porterStemming
- Attributes
- Example
- regex
- Attributes
- Example
- reverse
- Attributes
- Example
- shingle
- Attributes
- Example
- snowballStemming
- Attributes
- Example
- spanishPluralStemming
- Attributes
- Example
- stempel
- Attributes
- Example
- stopword
- Attributes
- Example
- trim
- Attributes
- Example
- wordDelimiterGraph
- Attributes
- Example
A token filter performs operations such as the following:
Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".
Redaction, the removal of sensitive information from public documents.
Token Filters always require a type field, and some take additional options as well. It has the following syntax:
"tokenFilters": [ { "type": "<token-filter-type>", "<additional-option>": <value> } ]
The following sample index definitions and queries use the sample
collection named minutes
. If you add
the minutes
collection to a database in your Atlas cluster,
you can create the following sample indexes from the Atlas Search Visual Editor
or JSON Editor in the Atlas UI and run the sample queries against
this collection. To create these indexes, after you select your preferred
configuration method in the Atlas UI, select the database and
collection, and refine your index to add custom analyzers that use
token filters.
Note
When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.
Name | Label that identifies the custom analyzer. |
Used In | Fields that use the custom analyzer. Value is None if
custom analyzer isn't used to analyze any fields. |
Character Filters | Atlas Search character filters configured
in the custom analyzer. |
Tokenizer | Atlas Search tokenizer configured in the
custom analyzer. |
Token Filters | Atlas Search token filters configured in
the custom analyzer. |
Actions | Clickable icons that indicate the actions that you can perform on the custom analyzer.
|
asciiFolding
The asciiFolding
token filter converts alphabetic, numeric, and
symbolic unicode characters that are not in the Basic Latin Unicode
block
to their ASCII equivalents, if available.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be asciiFolding . |
originalTokens | string | no | String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:
Default: |
Example
The following index definition indexes the
page_updated_by.first_name
field in the minutes collection using a custom analyzer named
asciiConverter
. The custom analyzer specifies the following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the
asciiFolding
token filter to convert the field values to their ASCII equivalent.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
asciiConverter
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select asciiFolding from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.first_name field.
Select page_updated_by.first_name from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
asciiConverter
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "dynamic": false, "fields": { "page_updated_by": { "type": "document", "dynamic": false, "fields": { "first_name": { "type": "string", "analyzer": "asciiConverter" } } } } }, "analyzers": [ { "name": "asciiConverter", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "asciiFolding" } ] } ] }
The following query searches the first_name
field in the
minutes collection for names
using their ASCII equivalent.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "Sian", "path": "page_updated_by.first_name" } } }, { "$project": { "_id": 1, "page_updated_by.last_name": 1, "page_updated_by.first_name": 1 } } ])
[ { _id: 1, page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'} } ]
Atlas Search returns document with _id: 1
in the results because Atlas Search
created the following tokens (searchable terms) for the
page_updated_by.first_name
field in the document, which it then
used to match to the query term Sian
:
Field Name | Output Tokens |
---|---|
page_updated_by.first_name | Sian |
daitchMokotoffSoundex
The daitchMokotoffSoundex
token filter creates tokens for words
that sound the same based on the Daitch-Mokotoff Soundex
phonetic algorithm. This filter can generate multiple encodings for
each input, where each encoded token is a 6 digit number.
Note
Don't use the daitchMokotoffSoundex
token filter in:
Synonym or autocomplete type mapping definitions.
Operators where
fuzzy
is enabled. Atlas Search supports thefuzzy
option for the following operators:
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be daitchMokotoffSoundex . |
originalTokens | string | no | String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:
Default: |
Example
The following index definition indexes the page_updated_by.last_name
field in the minutes collection using
a custom analyzer named dmsAnalyzer
. The custom analyzer specifies
the following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the
daitchMokotoffSoundex
token filter to encode the tokens for words that sound the same.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
dmsAnalyzer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select daitchMokotoffSoundex from the dropdown and select the value shown in the following table for the originalTokens field:
FieldValueoriginalTokens
include
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.last_name field.
Select page_updated_by.last_name from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
dmsAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "dynamic": false, "fields": { "page_updated_by": { "type": "document", "dynamic": false, "fields": { "last_name": { "type": "string", "analyzer": "dmsAnalyzer" } } } } }, "analyzers": [ { "name": "dmsAnalyzer", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "daitchMokotoffSoundex", "originalTokens": "include" } ] } ] }
The following query searches for terms that sound similar to
AUERBACH
in the page_updated_by.last_name
field of
the minutes collection.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "AUERBACH", "path": "page_updated_by.last_name" } } }, { "$project": { "_id": 1, "page_updated_by.last_name": 1 } } ])
[ { "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } } { "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } } ]
Atlas Search returns documents with _id: 1
and _id: 2
because the
terms in both documents are phonetically similar, and are coded using
the same six digit numbers (097400
and 097500
). The following
table shows the tokens (searchable terms and six digit encodings)
that Atlas Search creates for the documents in the results:
Document ID | Output Tokens |
---|---|
"_id": 1 | AUERBACH , 097400 , 097500 |
"_id": 2 | OHRBACH , 097400 , 097500 |
edgeGram
The edgeGram
token filter tokenizes input from the left side, or
"edge", of a text input into n-grams of configured sizes.
Note
Typically, token filters operate similarly to a pipeline, with each input token
yielding no more than 1 output token that is then inputted into the subsequent token.
The edgeGram
token filter, by contrast, is a graph-producing filter that yields
multiple output tokens from a single input token.
Because synonym and autocomplete field type mapping definitions only work when used with non-graph-producing token filters, you can't use the edgeGram token filter in synonym or autocomplete field type mapping definitions.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be edgeGram . |
minGram | integer | yes | Number that specifies the minimum length of generated n-grams.
Value must be less than or
equal to maxGram . |
maxGram | integer | yes | Number that specifies the maximum length of generated n-grams.
Value must be greater than or
equal to minGram . |
termNotInBounds | string | no | String that specifies whether to index tokens shorter than
If Default: |
Example
The following index definition indexes the title
field in the
minutes collection using a custom
analyzer named titleAutocomplete
. The custom analyzer specifies
the following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
icuFolding
token filter to apply character foldings to the tokens.edgeGram
token filter to create 4 to 7 character long tokens from the left side.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
titleAutocomplete
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select icuFolding from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select edgeGram from the dropdown and type the value shown in the following table for the fields:
FieldValueminGram
4maxGram
7Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
titleAutocomplete
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "titleAutocomplete", "mappings": { "dynamic": false, "fields": { "title": { "type": "string", "analyzer": "titleAutocomplete" } } }, "analyzers": [ { "name": "titleAutocomplete", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" }, { "type": "edgeGram", "minGram": 4, "maxGram": 7 } ] } ] }
The following query searches the title
field of the minutes collection for terms that begin with
mee
, followed by any number of other characters.
db.minutes.aggregate([ { "$search": { "wildcard": { "query": "mee*", "path": "title", "allowAnalyzedField": true } } }, { "$project": { "_id": 1, "title": 1 } } ])
[ { _id: 1, title: 'The team's weekly meeting' }, { _id: 3, title: 'The regular board meeting' } ]
Atlas Search returns documents with _id: 1
and _id: 3
because the
documents contain the term meeting
, which matches the query
criteria. Specifically, Atlas Search creates the following 4 to 7 character
tokens (searchable terms) for the documents in the results, which it
then matches to the query term mee*
:
Document ID | Output Tokens |
---|---|
"_id": 1 | team , team' , team's , week , weekl ,
weekly , meet , meeti , meetin , meeting |
"_id": 3 | regu , regul , regula , regular , boar ,
board , meet , meeti , meetin , meeting |
englishPossessive
The englishPossessive
token filter removes possessives
(trailing 's
) from words.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be englishPossessive . |
Example
The following index definition indexes the title
field in the
minutes collection using a custom
analyzer named englishPossessiveStemmer
. The custom analyzer
specifies the following:
Apply the standard tokenizer to create tokens (search terms) based on word break rules.
Apply the englishPossessive token filter to remove possessives (trailing
's
) from the tokens.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
englishPossessiveStemmer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select englishPossessive from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
englishPossessiveStemmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "title": { "type": "string", "analyzer": "englishPossessiveStemmer" } } }, "analyzers": [ { "name": "englishPossessiveStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "englishPossessive" } ] } ] }
The following query searches the title
field in the
minutes collection for the term
team
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "team", "path": "title" } } }, { "$project": { "_id": 1, "title": 1 } } ])
[ { _id: 1, title: 'The team's weekly meeting' }, { _id: 2, title: 'The check-in with sales team' } ]
Atlas Search returns results that contain the term team
in the
title
field. Atlas Search returns the document with _id: 1
because
Atlas Search transforms team's
in the title
field to the token
team
during analysis. Specifically, Atlas Search creates the following
tokens (searchable terms) for the documents in the results, which it
then matches to the query term:
Document ID | Output Tokens |
---|---|
"_id": 1 | The , team , weekly , meeting |
"_id": 2 | The , check , in , with , sales , team |
flattenGraph
The flattenGraph
token filter transforms a token filter graph into
a flat form suitable for indexing. If you use the
wordDelimiterGraph token filter, use
this filter after the wordDelimiterGraph token filter.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be flattenGraph . |
Example
The following index definition indexes the message
field in the
minutes collection using a custom
analyzer called wordDelimiterGraphFlatten
. The custom analyzer
specifies the following:
Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Apply the following filters to the tokens:
wordDelimiterGraph token filter to split tokens based on sub-words, generate tokens for the original words, and also protect the word
SIGN_IN
from delimination.flattenGraph token filter to flatten the tokens to a flat form.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
wordDelimiterGraphFlatten
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select wordDelimiterGraph from the dropdown and configure the following fields for the token filter.
Select the following fields:
FieldValuedelimiterOptions.generateWordParts
truedelimiterOptions.preserveOriginal
trueType
SIGN_IN
in theprotectedWords.words
field.Select
protectedWords.ignoreCase
.
Click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select flattenGraph from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
wordDelimiterGraphFlatten
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "message": { "type": "string", "analyzer": "wordDelimiterGraphFlatten" } } }, "analyzers": [ { "name": "wordDelimiterGraphFlatten", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "wordDelimiterGraph", "delimiterOptions" : { "generateWordParts" : true, "preserveOriginal" : true }, "protectedWords": { "words": [ "SIGN_IN" ], "ignoreCase": false } }, { "type": "flattenGraph" } ] } ] }
The following query searches the message
field in the
minutes collection for the term
sign
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "sign", "path": "message" } } }, { "$project": { "_id": 1, "message": 1 } } ])
[ { _id: 3, message: 'try to sign-in' } ]
Atlas Search returns the document with _id: 3
in the results for the
query term sign
even though the document contains the
hyphenated term sign-in
in the title
field. The
wordDelimiterGraph token
filter creates a token filter graph and the flattenGraph token filter transforms the token filter
graph into a flat form suitable for indexing. Specifically, Atlas Search
creates the following tokens (searchable terms) for the document in
the results, which it then matches to the query term sign
:
Document ID | Output Tokens |
_id: 3 | try , to , sign-in , sign , in |
icuFolding
The icuFolding
token filter applies character folding from Unicode
Technical Report #30
such as accent removal, case folding, canonical duplicates folding, and
many others detailed in the report.
Attributes
It has the following attribute:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be icuFolding . |
Example
The following index definition indexes the text.sv_FI
field in
the minutes collection using a
custom analyzer named diacriticFolder
. The custom analyzer
specifies the following:
Apply the keyword tokenizer to tokenize all the terms in the string field as a single term.
Use the
icuFolding
token filter to apply foldings such as accent removal, case folding, canonical duplicates folding, and so on.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
diacriticFolder
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select icuFolding from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.sv_FI nested field.
Select text.sv_FI nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
diacriticFolder
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "diacriticFolder", "mappings": { "fields": { "text": { "type": "document", "fields": { "sv_FI": { "analyzer": "diacriticFolder", "type": "string" } } } } }, "analyzers": [ { "name": "diacriticFolder", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "icuFolding" } ] } ] }
The following query uses the the wildcard operator to search
the text.sv_FI
field in the minutes collection for all terms that contain the
term avdelning
, preceded and followed by any number of other
characters.
db.minutes.aggregate([ { "$search": { "index": "default", "wildcard": { "query": "*avdelning*", "path": "text.sv_FI", "allowAnalyzedField": true } } }, { "$project": { "_id": 1, "text.sv_FI": 1 } } ])
[ { _id: 1, text: { sv_FI: 'Den här sidan behandlar avdelningsmöten' } }, { _id: 2, text: { sv_FI: 'Först talade chefen för försäljningsavdelningen' } } ]
Atlas Search returns the document with _id: 1
and _id: 2
in the results
because the documents contain the query term avdelning
followed by
other characters in the document with _id: 1
and preceded and
followed by other characters in the document with _id: 2
.
Specifically, Atlas Search creates the following tokens for the documents in
the results, which it then matches to the query term *avdelning*
.
Document ID | Output Tokens |
---|---|
_id: 1 | den har sidan behandlar avdelningsmoten |
_id: 2 | forst talade chefen for forsaljningsavdelningen |
icuNormalizer
The icuNormalizer
token filter normalizes tokens using a standard
Unicode Normalization Mode.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be icuNormalizer . |
normalizationForm | string | no | Normalization form to apply. Accepted values are:
To learn more about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15. Default: |
Example
The following index definition indexes the message
field in the
minutes collection using a custom
analyzer named textNormalizer
. The custom analyzer specifies
the following:
Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Use the
icuNormalizer
token filter to normalize tokens by Compatibility Decomposition, followed by Canonical Composition.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
textNormalizer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select icuNormalizer from the dropdown and select
nfkc
from the normalizationForm dropdown.Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
textNormalizer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "textNormalizer", "mappings": { "fields": { "message": { "type": "string", "analyzer": "textNormalizer" } } }, "analyzers": [ { "name": "textNormalizer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfkc" } ] } ] }
The following query searches the message
field in the
minutes collection for the term
1
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "1", "path": "message" } } }, { "$project": { "_id": 1, "message": 1 } } ])
[ { _id: 2, message: 'do not forget to SIGN-IN. See â‘ for details.' } ]
Atlas Search returns the document with _id: 2
in the results for the
query term 1
even though the document contains the circled number
â‘
in the message
field because the icuNormalizer
token
filter creates the token 1
for this character using the nfkc
normalization form. The following table shows the tokens (searchable
terms) that Atlas Search creates for the document in the results using the
nfkc
normalization form and by comparison, the tokens it creates
for the other normalization forms.
Normalization Forms | Output Tokens | Matches 1 |
nfd | do , not , forget , to , SIGN-IN. , See ,
â‘ , for , details. | X |
nfc | do , not , forget , to , SIGN-IN. , See ,
â‘ , for , details. | X |
nfkd | do , not , forget , to , SIGN-IN. , See ,
1 , for , details. | √ |
nfkc | do , not , forget , to , SIGN-IN. , See ,
1 , for , details. | √ |
kStemming
The kStemming
token filter combines algorithmic stemming with a
built-in dictionary for the english language to stem words. It expects
lowercase text and doesn't modify uppercase text.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be kStemming . |
Example
The following index definition indexes the text.en_US
field in
the minutes collection using a
custom analyzer named kStemmer
. The custom analyzer specifies the
following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
kStemmer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select kStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
kStemmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "kStemmer", "mappings": { "dynamic": true }, "analyzers": [ { "name": "kStemmer", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "kStemming" } ] } ] }
The following query searches the text.en_US
field in the
minutes collection for the term
Meeting
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "Meeting", "path": "text.en_US" } } }, { "$project": { "_id": 1, "text.en_US": 1 } } ])
[ { _id: 1, text: { en_US: '<head> This page deals with department meetings. </head>' } } ]
Atlas Search returns the document with _id: 1
, which contains the plural
term meetings
in lowercase. Atlas Search matches the query term to the
document because the lowercase token filter
normalizes token text to lowercase and the kStemming token filter lets Atlas Search match the plural
meetings
in the text.en_US
field of the document to the
singular query term. Atlas Search also analyzes the query term using the
index analyzer (or if specified, using the searchAnalyzer
).
Specifically, Atlas Search creates the following tokens (searchable terms)
for the document in the results, which it then uses to match to the
query term:
head
, this
, page
, deal
, with
, department
, meeting
, head
length
The length
token filter removes tokens that are too short or too
long.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be length . |
min | integer | no | Number that specifies the minimum length of a token.
Value must be less than or equal to
Default: |
max | integer | no | Number that specifies the maximum length of a token.
Value must be greater than or equal to
Default: |
Example
The following index definition indexes the text.sv_FI
field in
the minutes collection using a
custom analyzer named longOnly
. The custom analyzer specifies the
following:
Use the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
icuFolding token filter to apply character foldings.
length
token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
longOnly
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select icuFolding from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select length from the dropdown and configure the following field for the token filter:
FieldValuemin
20
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.sv.FI nested field.
Select text.sv.FI nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
longOnly
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "text": { "type": "document", "dynamic": true, "fields": { "sv_FI": { "type": "string", "analyzer": "longOnly" } } } } }, "analyzers": [ { "name": "longOnly", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuFolding" }, { "type": "length", "min": 20 } ] } ] }
The following query searches the text.sv_FI
field in the
minutes collection for the term
forsaljningsavdelningen
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "forsaljningsavdelningen", "path": "text.sv_FI" } } }, { "$project": { "_id": 1, "text.sv_FI": 1 } } ])
[ { _id: 2, text: { sv_FI: 'Först talade chefen för försäljningsavdelningen' } } ]
Atlas Search returns the document with _id: 2
, which contains the term
försäljningsavdelningen
. Atlas Search matches the document to the query
term because the term has more than 20 characters. Additionally,
although the query term forsaljningsavdelningen
doesn't include
the diacritic characters, Atlas Search matches the query term to the
document by folding the diacritics in the original term in the
document. Specifically, Atlas Search creates the following tokens
(searchable terms) for the document with _id: 2
.
forsaljningsavdelningen
Atlas Search won't return any results for a search for any other term in the
text.sv_FI
field in the collection because all other terms in the
field have less than 20 characters.
lowercase
The lowercase
token filter normalizes token text to lowercase.
Attributes
It has the following attribute:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be lowercase . |
Examples
The following example index definition indexes the title
field in the minutes
collection as type autocomplete with the nGram
tokenization
strategy. It applies a custom analyzer named keywordLowerer
on the title
field. The custom analyzer specifies the
following:
Apply keyword tokenizer to create a single token for a string or array of strings.
Apply the
lowercase
token filter to convert token text to lowercase.
In the Custom Analyzers section, click Add Custom Analyzer.
Choose Create Your Own radio button and click Next.
Type
keywordLowerer
in the Analyzer Name field.Expand Tokenizer if it's collapsed and select the keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and Autocomplete from the Data Type dropdown.
In the properties section for the data type, select the following values from the dropdown for the property:
Property NameValueAnalyzerkeywordLowerer
TokenizationnGram
Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "fields": { "title": { "analyzer": "keywordLowerer", "tokenization": "nGram", "type": "autocomplete" } } }, "analyzers": [ { "name": "keywordLowerer", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "lowercase" } ] } ] }
The following query searches the title
field using the
autocomplete operator for the
characters standup
.
db.minutes.aggregate([ { "$search": { "index": "default", "autocomplete": { "query": "standup", "path": "title" } } }, { "$project": { "_id": 1, "title": 1 } } ])
[ { _id: 4, title: 'The daily huddle on tHe StandUpApp2' } ]
Atlas Search returns the document with _id: 4
in the results
because the document contains the query term standup
. Atlas Search
creates tokens for the title
field using the keyword
tokenizer, lowercase
token filter, and the nGram
tokenization strategy for the autocomplete type. Specifically, Atlas Search uses
the keyword
tokenizer to tokenize the entire string as a
single token, which supports only exact matches on the entire
string, and then it uses the lowercase
token filter to
convert the tokens to lowercase. For the document in the
results, Atlas Search creates the following token using the custom
analyzer:
Document ID | Output Tokens |
_id: 4 | the daily huddle on the standupapp2 |
After applying the custom analyzer, Atlas Search creates further
tokens of n-grams because Atlas Search indexes the title
field as
the autocomplete type as
specified in the index definition. Atlas Search uses the tokens of
n-grams, which includes a token for standup
, to match the
document to the query term standup
.
The following index definition indexes the message
field in the
minutes collection using a custom
analyzer named lowerCaser
. The custom analyzer specifies the
following:
Apply standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
icuNormalizer to normalize the tokens using a standard Unicode Normalization Mode.
lowercase
token filter to convert token text to lowercase.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
lowerCaser
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select icuNormalizer from the dropdown and then select
nfkd
from the normalizationForm dropdown.Click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select lowercase from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
lowerCaser
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
{ "mappings": { "fields": { "message": { "type": "string", "analyzer": "lowerCaser" } } }, "analyzers": [ { "name": "lowerCaser", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "icuNormalizer", "normalizationForm": "nfkd" }, { "type": "lowercase" } ] } ] }
The following query searches the message
field for the term
sign-in
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "sign-in", "path": "message" } } }, { "$project": { "_id": 1, "message": 1 } } ])
[ { _id: 1, message: 'try to siGn-In' }, { _id: 3, message: 'try to sign-in' }, { _id: 2, message: 'do not forget to SIGN-IN. See â‘ for details.' } ]
Atlas Search returns the documents with _id: 1
, _id: 3
, and _id:
2
in the results for the query term sign-in
because the
icuNormalizer
tokenizer first creates separate tokens by
splitting the text, including the hyphenated word, but retains the
original letter case in the document and then the lowercase
token
filter converts the tokens to lowercase. Atlas Search also analyzes the
query term using the index analyzer (or if specified, using the
searchAnalyzer
) to split the query term and match it to the
document.
Normalization Forms | Output Tokens |
_id: 1 | try , to , sign , in |
_id: 3 | try , to , sign , in |
_id: 2 | do , not , forget , to , sign , in ,
see , for , details |
nGram
The nGram
token filter tokenizes input into n-grams of configured
sizes. You can't use the nGram token filter in
synonym or autocomplete mapping definitions.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type. Value must be nGram . |
minGram | integer | yes | Number that specifies the minimum length of generated n-grams.
Value must be less than or
equal to maxGram . |
maxGram | integer | yes | Number that specifies the maximum length of generated n-grams.
Value must be greater than or
equal to minGram . |
termNotInBounds | string | no | String that specifies whether to index tokens shorter than
If Default: |
Example
The following index definition indexes the title
field in the
minutes collection using the custom
analyzer named titleAutocomplete
. The custom analyzer function
specifies the following:
Apply the standard tokenizer to create tokens based on the word break rules.
Apply a series of token filters on the tokens:
englishPossessive
to remove possessives (trailing's
) from words.nGram
to tokenize words into 4 to 7 characters in length.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
titleAutocomplete
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select englishPossessive from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select nGram from the dropdown and configure the following fields for the token filter:
FieldValueminGram
4maxGram
7Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
titleAutocomplete
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "title": { "type": "string", "analyzer": "titleAutocomplete" } } }, "analyzers": [ { "name": "titleAutocomplete", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "englishPossessive" }, { "type": "nGram", "minGram": 4, "maxGram": 7 } ] } ] }
The following query uses the wildcard operator to search
the title
field in the minutes
collection for the term meet
followed by any number of other
characters after the term.
db.minutes.aggregate([ { "$search": { "index": "default", "wildcard": { "query": "meet*", "path": "title", "allowAnalyzedField": true } } }, { "$project": { "_id": 1, "title": 1 } } ])
[ { _id: 1, title: 'The team's weekly meeting' }, { _id: 3, title: 'The regular board meeting' } ]
Atlas Search returns the documents with _id: 1
and _id: 3
because
the documents contain the term meeting
, which Atlas Search matches to
the query criteria meet*
by creating the following tokens
(searchable terms).
Normalization Forms | Output Tokens |
---|---|
_id: 2 | team , week , weekl , weekly , eekl ,
eekly , ekly , meet , meeti , meetin ,
meeting , eeti , eeti , eeting , etin ,
eting , ting |
_id: 3 | regu , regul , regula , regular , egul ,
egula , egular , gula , gular , ular ,
boar , board , oard , meet , meeti , meetin ,
meeting , eeti , eeti , eeting , etin ,
eting , ting |
Note
Atlas Search doesn't create tokens for terms less than 4 characters (such
as the
) and greater than 7 characters because the
termNotInBounds
parameter is set to omit
by default. If
you set the value for termNotInBounds
parameter to
include
, Atlas Search would create tokens for the term the
also.
porterStemming
The porterStemming
token filter uses the porter stemming algorithm
to remove the common morphological and inflectional suffixes from
words in English. It expects lowercase text and doesn't work as
expected for uppercase text.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be porterStemming . |
Example
The following index definition indexes the title
field in the
minutes collection using a custom
analyzer named porterStemmer
. The custom analyzer specifies the
following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the following token filters on the tokens:
lowercase token filter to convert the words to lowercase.
porterStemming token filter to remove the common morphological and inflectional suffixes from the words.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
porterStemmer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select porterStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
porterStemmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "title": { "type": "string", "analyzer": "porterStemmer" } } }, "analyzers": [ { "name": "porterStemmer", "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "porterStemming" } ] } ] }
The following query searches the title
field in the minutes collection for the term Meet
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "Meet", "path": "title" } } }, { "$project": { "_id": 1, "title": 1 } } ])
[ { _id: 1, title: 'The team's weekly meeting' }, { _id: 3, title: 'The regular board meeting' } ]
Atlas Search returns the documents with _id: 1
and _id: 3
because
the lowercase token filter normalizes
token text to lowercase and then the porterStemming
token filter
stems the morphological suffix from the meeting
token to create
the meet
token, which Atlas Search matches to the query term Meet
.
Specifically, Atlas Search creates the following tokens (searchable terms)
for the documents in the results, which it then matches to the query
term Meet
:
Normalization Forms | Output Tokens |
---|---|
_id: 1 | the , team' , weekli , meet |
_id: 3 | the , regular , board , meet |
regex
The regex
token filter applies a regular expression to each token,
replacing matches with a specified string.
Attributes
It has the following attributes:
Name | Type | Required? | Description | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter.
Value must be regex . | |||||||||||||||||||||
pattern | string | yes | Regular expression pattern to apply to each token. | |||||||||||||||||||||
replacement | string | yes | Replacement string to substitute wherever a matching pattern occurs. If you specify an empty string (
| |||||||||||||||||||||
matches | string | yes | Acceptable values are:
If |
Example
The following index definition indexes the page_updated_by.email
field in the minutes collection
using a custom analyzer named emailRedact
. The custom analyzer
specifies the following:
Apply the keyword tokenizer to index all words in the field value as a single term.
Apply the following token filters on the tokens:
lowercase token filter to turn uppercase characters in the tokens to lowercase.
regex
token filter to find strings that look like email addresses in the tokens and replace them with the wordredacted
.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
emailRedact
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select regex from the dropdown and configure the following for the token filter:
Type
^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$
in the pattern field.Type
redacted
in the replacement field.Select
all
from the matches dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email nested field.
Select page_updated_by.email nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
emailRedact
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "lucene.standard", "mappings": { "dynamic": false, "fields": { "page_updated_by": { "type": "document", "fields": { "email": { "type": "string", "analyzer": "emailRedact" } } } } }, "analyzers": [ { "charFilters": [], "name": "emailRedact", "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "lowercase" }, { "matches": "all", "pattern": "^([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,5})$", "replacement": "redacted", "type": "regex" } ] } ] }
The following query searches the page_updated_by.email
field
in the minutes collection using the
wildcard operator for the term example.com
preceded by
any number of other characters.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "index": "default", 5 "wildcard": { 6 "query": "*example.com", 7 "path": "page_updated_by.email", 8 "allowAnalyzedField": true 9 } 10 } 11 }, 12 { 13 "$project": { 14 "_id": 1, 15 "page_updated_by.email": 1 16 } 17 } 18 ])
Atlas Search doesn't return any results for the query although the
page_updated_by.email
field contains the word example.com
in
the email addresses. Atlas Search tokenizes strings that match the regular
expression provided in the custom analyzer with the word redacted
and so, Atlas Search doesn't match the query term to any document.
reverse
The reverse
token filter reverses each string token.
Attributes
It has the following attribute:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter.
Value must be reverse . |
Example
The following index definition indexes the page_updated_by.email
fields in the minutes collection
using a custom analyzer named keywordReverse
. The custom analyzer
specifies the following:
Apply the keyword tokenizer to tokenize entire strings as single terms.
Apply the
reverse
token filter to reverse the string tokens.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
keywordReverse
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select reverse from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email nested field.
Select page_updated_by.email nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
keywordReverse
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "lucene.keyword", "mappings": { "dynamic": }, "analyzers": [ { "name": "keywordReverse", "charFilters": [], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "reverse" } ] } ] }
The following query searches the page_updated_by.email
field in
the minutes collection using the
wildcard operator to match any characters preceding the
characters @example.com
in reverse order. The reverse
token
filter can speed up leading wildcard queries.
db.minutes.aggregate([ { "$search": { "index": "default", "wildcard": { "query": "*@example.com", "path": "page_updated_by.email", "allowAnalyzedField": true } } }, { "$project": { "_id": 1, "page_updated_by.email": 1, } } ])
For the preceding query, Atlas Search applies the custom analyzer to the wildcard query to transform the query as follows:
moc.elpmaxe@*
Atlas Search then runs the query against the indexed tokens, which are also reversed. The query returns the following documents:
[ { _id: 1, page_updated_by: { email: 'auerbach@example.com' } }, { _id: 2, page_updated_by: { email: 'ohrback@example.com' } }, { _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }, { _id: 4, page_updated_by: { email: 'levinski@example.com' } } ]
Specifically, Atlas Search creates the following tokens (searchable terms)
for the documents in the results, which it then matches to the query
term moc.elpmaxe@*
:
Normalization Forms | Output Tokens |
---|---|
_id: 1 | moc.elpmaxe@hcabreua |
_id: 2 | moc.elpmaxe@kcabrho |
_id: 3 | moc.elpmaxe@yksniwel |
_id: 4 | moc.elpmaxe@iksnivel |
shingle
The shingle
token filter constructs shingles (token n-grams) from a
series of tokens. You can't use the shingle
token filter in
synonym or autocomplete mapping definitions.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be shingle . |
minShingleSize | integer | yes | Minimum number of tokens per shingle. Must be greater than or
equal to 2 and less than or equal to maxShingleSize . |
maxShingleSize | integer | yes | Maximum number of tokens per shingle. Must be greater than or
equal to minShingleSize . |
Example
The following index definition example on the page_updated_by.email
field in the minutes collection uses
two custom analyzers, emailAutocompleteIndex
and
emailAutocompleteSearch
, to implement autocomplete-like
functionality. Atlas Search uses the emailAutocompleteIndex
analyzer during
index creation to:
Replace
@
characters in a field withAT
Create tokens with the whitespace tokenizer
Shingle tokens
Create edgeGram of those shingled tokens
Atlas Search uses the emailAutocompleteSearch
analyzer during a search to:
Replace
@
characters in a field withAT
Create tokens with the whitespace tokenizer tokenizer
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
emailAutocompleteIndex
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following key and value:
KeyValue@
AT
Click Add character filter to add the character filter to your custom analyzer.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown and enter
15
in the maxTokenLength field.Expand Token Filters and click Add token filter.
Select shingle from the dropdown and configure the following fields.
FieldField ValueminShingleSize
2minShingleSize
3Click Add token filter to add another token filter.
Click Add token filter to add another token filter.
Select edgeGram from the dropdown and configure the following fields for the token filter:
FieldField ValueminGram
2maxGram
15Click Add token filter to add the token filter to your custom analyzer.
Click Add to add the custom analyzer to your index.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
emailAutocompleteSearch
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following key and value:
KeyValue@
AT
Click Add character filter to add the character filter to your custom analyzer.
Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown and enter
15
in the maxTokenLength field.Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email nested field.
Select page_updated_by.email nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
emailAutocompleteIndex
from the Index Analyzer dropdown andemailAutocompleteSearch
from the Search Analyzer dropdown.Click Add, then Save Changes.
Replace the default index definition with the following example:
1 { 2 "analyzer": "lucene.keyword", 3 "mappings": { 4 "dynamic": true, 5 "fields": { 6 "page_updated_by": { 7 "type": "document", 8 "fields": { 9 "email": { 10 "type": "string", 11 "analyzer": "emailAutocompleteIndex", 12 "searchAnalyzer": "emailAutocompleteSearch" 13 } 14 } 15 } 16 } 17 }, 18 "analyzers": [ 19 { 20 "name": "emailAutocompleteIndex", 21 "charFilters": [ 22 { 23 "mappings": { 24 "@": "AT" 25 }, 26 "type": "mapping" 27 } 28 ], 29 "tokenizer": { 30 "maxTokenLength": 15, 31 "type": "whitespace" 32 }, 33 "tokenFilters": [ 34 { 35 "maxShingleSize": 3, 36 **** "minShingleSize": 2, 37 "type": "shingle" 38 }, 39 { 40 "maxGram": 15, 41 "minGram": 2, 42 "type": "edgeGram" 43 } 44 ] 45 }, 46 { 47 "name": "emailAutocompleteSearch", 48 "charFilters": [ 49 { 50 "mappings": { 51 "@": "AT" 52 }, 53 "type": "mapping" 54 } 55 ], 56 "tokenizer": { 57 "maxTokenLength": 15, 58 "type": "whitespace" 59 } 60 } 61 ] 62 }
The following query searches for an email address in the
page_updated_by.email
field of the minutes collection:
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "auerbach@ex", "path": "page_updated_by.email" } } }, { "$project": { "_id": 1, "page_updated_by.email": 1 } } ])
[ { _id: 1, page_updated_by: { email: 'auerbach@example.com' } } ]
Atlas Search creates search tokens using the emailAutocompleteSearch
analyzer, which it then matches to the index tokens that it created
using the emailAutocompleteIndex
analyzer. The following table
shows the search and index tokens (up to 15 characters) that Atlas Search
creates:
Search Tokens | Index Tokens |
---|---|
auerbachATexamp | au , aue , auer , auerb , auerba ,
auerbac , auerbach , auerbachA , auerbachAT ,
auerbachATe , auerbachATex , auerbachATexa ,
auerbachATexam , auerbachATexamp |
snowballStemming
The snowballStemming
token filters Stems tokens using a
Snowball-generated stemmer.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be snowballStemming . |
stemmerName | string | yes | The following values are valid:
|
Example
The following index definition indexes the text.fr_CA
field in
the minutes collection using
a custom analyzer named frenchStemmer
. The custom analyzer
specifies the following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the following token filters on the tokens:
lowercase token filter to convert the tokens to lowercase.
french
variant of thesnowballStemming
token filter to stem words.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
frenchStemmer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select snowballStemming from the dropdown and then select
french
from the stemmerName dropdown.Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.fr_CA nested field.
Select text.fr_CA nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
frenchStemmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "text": { "type": "document", "fields": { "fr_CA": { "type": "string", "analyzer": "frenchStemmer" } } } } }, "analyzers": [ { "name": "frenchStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "snowballStemming", "stemmerName": "french" } ] } ] }
The following query searches the text.fr_CA
field in the
minutes collection for the term
réunion
.
db.minutes.aggregate([ { "$search": { "text": { "query": "réunion", "path": "text.fr_CA" } } }, { "$project": { "_id": 1, "text.fr_CA": 1 } } ])
[ { _id: 1, text: { fr_CA: 'Cette page traite des réunions de département' } } ]
Atlas Search returns document with _id: 1
in the results. Atlas Search matches
the query term to the document because it creates the following
tokens for the document, which it then used to match to the query
term réunion
:
Document ID | Output Tokens |
---|---|
_id: 1 | cet , pag , trait , de , réunion , de , départ |
spanishPluralStemming
The spanishPluralStemming
token filter stems spanish plural words.
It expects lowercase text.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be spanishPluralStemming . |
Example
The following index definition indexes the text.es_MX
field in
the minutes collection using a
custom analyzer named spanishPluralStemmer
. The custom analyzer
specifies the following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the following token filters on the tokens:
lowercase token filter to convert spanish terms to lowercase.
spanishPluralStemming
token filter to stem plural spanish words in the tokens into their singular form.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
spanishPluralStemmer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select spanishPluralStemming from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.es_MX nested field.
Select text.es_MX nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
spanishPluralStemmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "spanishPluralStemmer", "mappings": { "fields": { "text: { "type": "document", "fields": { "es_MX": { "analyzer": "spanishPluralStemmer", "searchAnalyzer": "spanishPluralStemmer", "type": "string" } } } } }, "analyzers": [ { "name": "spanishPluralStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "spanishPluralStemming" } ] } ] }
The following query searches the text.es_MX
field in the
minutes collection for the spanish
term punto
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "punto", "path": "text.es_MX" } } }, { "$project": { "_id": 1, "text.es_MX": 1 } } ])
[ { _id: 4, text : { es_MX: 'La página ha sido actualizada con los puntos de la agenda.', } } ]
Atlas Search returns the document with _id: 4
because the text.es_MX
field in the document contains the plural term puntos
. Atlas Search
matches this document for the query term punto
because Atlas Search
analyzes puntos
as punto
by stemming the plural (s
) from the
term. Specifically, Atlas Search creates the following tokens (searchable
terms) for the document in the results, which it then uses to match to
the query term:
Document ID | Output Tokens |
_id: 4 | la , pagina , ha , sido , actualizada ,
con , los , punto , de , la , agenda |
stempel
The stempel
token filter uses Lucene's default Polish stemmer table
to stem words in the Polish language. It expects lowercase text.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be stempel . |
Example
The following index definition indexes the text.pl_PL
field in
the minutes collection using a
custom analyzer named stempelStemmer
. The custom analyzer
specifies the following:
Apply the standard tokenizer to create tokens based on word break rules.
Apply the following filters on the tokens:
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
stempelStemmer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select stempel from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.pl_PL nested field.
Select text.pl_PL nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
stempelStemmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "analyzer": "stempelStemmer", "mappings": { "dynamic": true, "fields": { "text.pl_PL": { "analyzer": "stempelStemmer", "searchAnalyzer": "stempelStemmer", "type": "string" } } }, "analyzers": [ { "name": "stempelStemmer", "charFilters": [], "tokenizer": { "type": "standard" }, "tokenFilters": [ { "type": "lowercase" }, { "type": "stempel" } ] } ] }
The following query searches the text.pl_PL
field in the
minutes collection for the Polish
term punkt
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "punkt", "path": "text.pl_PL" } } }, { "$project": { "_id": 1, "text.pl_PL": 1 } } ])
[ { _id: 4, text: { pl_PL: 'Strona została zaktualizowana o punkty porządku obrad.' } } ]
Atlas Search returns the document with _id: 4
because the text.pl_PL
field in the document contains the plural term punkty
. Atlas Search matches
this document for the query term punkt
because Atlas Search analyzes
punkty
as punkt
by stemming the plural (y
) from the term.
Specifically, Atlas Search creates the following tokens (searchable terms) for
the document in the results, which it then matches to the query term:
Document ID | Output Tokens |
---|---|
_id: 4 | strona , zostać , zaktualizować , o , punkt ,
porzÄ…dek , obrada |
stopword
The stopword
token filter removes tokens that correspond to the
specified stop words. This token filter doesn't analyze the specified
stop words.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be stopword . |
tokens | array of strings | yes | List that contains the stop words that correspond to the tokens
to remove.
Value must be one or more stop words. |
ignoreCase | boolean | no | Flag that indicates whether to ignore the case of stop words when filtering the tokens to remove. The value can be one of the following:
Default: |
Example
The following index definition indexes the title
field in the
minutes collection using a custom
analyzer named stopwordRemover
. The custom analyzer specifies the
following:
Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Apply the
stopword
token filter to remove the tokens that match the defined stop wordsis
,the
, andat
. The token filter is case-insensitive and will remove all tokens that match the specified stopwords.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
stopwordRemover
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select stopword from the dropdown and type the following in the tokens field:
is
,the
,at
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
stopwordRemover
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "text": { "type" : "document", "fields": { "en_US": { "type": "string", "analyzer": "stopwordRemover" } } } } }, "analyzers": [ { "name": "stopwordRemover", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "stopword", "tokens": ["is", "the", "at"] } ] } ] }
The following query searches for the phrase head of the sales
in the text.en_US
field in the minutes collection.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "phrase": { 5 "query": "head of the sales", 6 "path": "text.en_US" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "text.en_US": 1 14 } 15 } 16 ])
1 [ 2 { 3 _id: 2, 4 text: { en_US: 'The head of the sales department spoke first.' } 5 } 6 ]
Atlas Search returns the document with _id: 2
because the en_US
field
contains the query term. Atlas Search doesn't create tokens for the stopword
the
in the document during analysis, but is still able to match
it to the query term because for string
fields, it also analyzes
the query term using the index analyzer (or if specified, using the
searchAnalyzer
) and removes the stopword from the query term,
which allows Atlas Search to match the query term to the document.
Specifically, Atlas Search creates the following tokens for the document in
the results:
Document ID | Output Tokens |
---|---|
_id: 2 | head , of , sales , department , spoke , first. |
trim
The trim
token filter trims leading and trailing whitespace from
tokens.
Attributes
It has the following attribute:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be trim . |
Example
The following index definition indexes the text.en_US
in the
the minutes collection using a
custom analyzer named tokenTrimmer
. The custom analyzer specifies
the following:
Apply the htmlStrip character filter to remove all HTML tags from the text except the
a
tag.Apply the keyword tokenizer to create a single token for the entire string.
Apply the
trim
token filter to remove leading and trailing whitespace in the tokens.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
tokenTrimmer
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select htmlStrip from the dropdown and type
a
in the ignoredTags field.Click Add character filter to add the character filter to your custom analyzer.
Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Expand Token Filters and click Add token filter.
Select trim from the dropdown.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the text.en_US nested field.
Select text.en_US nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
tokenTrimmer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "text": { "type": "document", "fields": { "en_US": { "type": "string", "analyzer": "tokenTrimmer" } } } } }, "analyzers": [ { "name": "tokenTrimmer", "charFilters": [{ "type": "htmlStrip", "ignoredTags": ["a"] }], "tokenizer": { "type": "keyword" }, "tokenFilters": [ { "type": "trim" } ] } ] }
The following query searches for the phrase *department meetings*
preceded and followed by any number of other characters in the
text.en_US
field in the minutes
collection.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "wildcard": { 5 "query": "*department meetings*", 6 "path": "text.en_US", 7 "allowAnalyzedField": true 8 } 9 } 10 }, 11 { 12 "$project": { 13 "_id": 1, 14 "text.en_US": 1 15 } 16 } 17 ])
1 [ 2 { 3 _id: 1, 4 text: { en_US: '<head> This page deals with department meetings. </head>' } 5 } 6 ]
Atlas Search returns the document with _id: 1
because the en_US
field
contains the query term department meetings
. Atlas Search creates the
following token for the document in the results, which shows that
Atlas Search removed the HTML tags, created a single token for the entire
string, and removed leading and trailing whitespaces in the token:
Document ID | Output Tokens |
---|---|
_id: 1 | This page deals with department meetings. |
wordDelimiterGraph
The wordDelimiterGraph
token filter splits tokens into sub-tokens
based on configured rules. We recommend that you don't use this token
filter with the standard tokenizer
because this tokenizer removes many of the intra-word delimiters that
this token filter uses to determine boundaries.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
type | string | yes | Human-readable label that identifies this token filter type.
Value must be wordDelimiterGraph . |
delimiterOptions | object | no | Object that contains the rules that determine how to split words into sub-words. Default: |
delimiterOptions .generateWordParts | boolean | no | Flag that indicates whether to split tokens based on sub-words.
For example, if Default: |
delimiterOptions .generateNumberParts | boolean | no | Flag that indicates whether to split tokens based on
sub-numbers. For example, if Default: |
delimiterOptions .concatenateWords | boolean | no | Flag that indicates whether to concatenate runs of sub-words.
For example, if Default: |
delimiterOptions .concatenateNumbers | boolean | no | Flag that indicates whether to concatenate runs of sub-numbers.
For example, if Default: |
delimiterOptions .concatenateAll | boolean | no | Flag that indicates whether to concatenate all runs.
For example, if Default: |
delimiterOptions .preserveOriginal | boolean | no | Flag that indicates whether to generate tokens of the original words. Default: |
delimiterOptions .splitOnCaseChange | boolean | no | Flag that indicates whether to split tokens based on letter-case
transitions. For example, if Default: |
delimiterOptions .splitOnNumerics | boolean | no | Flag that indicates whether to split tokens based on
letter-number transitions. For example, if Default: |
delimiterOptions .stemEnglishPossessive | boolean | no | Flag that indicates whether to remove trailing possessives from
each sub-word. For example, if Default: |
delimiterOptions .ignoreKeywords | boolean | no | Flag that indicates whether to skip tokens with the Default: |
protectedWords | object | no | Object that contains options for protected words. Default: |
protectedWords .words | array | conditional | List that contains the tokens to protect from delimination. If
you specify protectedWords , you must specify this option. |
protectedWords .ignoreCase | boolean | no | Flag that indicates whether to ignore case sensisitivity for protected words. Default: |
If true
, apply the flattenGraph token filter after this option to make the
token stream suitable for indexing.
Example
The following index definition indexes the title
field in the
minutes collection using a custom
analyzer named wordDelimiterGraphAnalyzer
. The custom analyzer
specifies the following:
Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.
Apply the wordDelimiterGraph token filter for the following:
Don't try and split
is
,the
, andat
. The exclusion is case sensitive. For exampleIs
andtHe
are not excluded.Split tokens on case changes and remove tokens that contain only alphabetical letters from the English alphabet.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
wordDelimiterGraphAnalyzer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Expand Token Filters and click Add token filter.
Select lowercase from the dropdown and click Add token filter to add the token filter to your custom analyzer.
Click Add token filter to add another token filter.
Select wordDelimiterGraph from the dropdown and configure the following fields:
Deselect delimiterOptions.generateWordParts and select delimiterOptions.splitOnCaseChange.
Type and then select from the dropdown the words
is
,the
, andat
, one at a time, in theprotectedWords.words
field.Deselect
protectedWords.ignoreCase
.
Click Add token filter to add the token filter to your custom analyzer.
Click Add to create the custom analyzer.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title nested field.
Select title nested from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
wordDelimiterGraphAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following example:
{ "mappings": { "fields": { "title": { "type": "string", "analyzer": "wordDelimiterGraphAnalyzer" } } }, "analyzers": [ { "name": "wordDelimiterGraphAnalyzer", "charFilters": [], "tokenizer": { "type": "whitespace" }, "tokenFilters": [ { "type": "wordDelimiterGraph", "protectedWords": { "words": ["is", "the", "at"], "ignoreCase": false }, "delimiterOptions" : { "generateWordParts" : false, "splitOnCaseChange" : true } } ] } ] }
The following query searches the title
field in the minutes collection for the term App2
.
db.minutes.aggregate([ { "$search": { "index": "default", "text": { "query": "App2", "path": "title" } } }, { "$project": { "_id": 1, "title": 1 } } ])
[ { _id: 4, title: 'The daily huddle on tHe StandUpApp2' } ]
Atlas Search returns the document with _id: 4
because the title
field
in the document contains App2
. Atlas Search splits tokens on case changes
and removes tokens created by a split that contain only alphabetical
letters. It also analyzes the query term using the index analyzer (or
if specified, using the searchAnalyzer
) to split the word on case
change and remove the letters preceding 2
. Specifically, Atlas Search
creates the following tokens for the document with _id : 4
for
the protectedWords
and delimiterOptions
options:
wordDelimiterGraph Options | Output Tokens |
---|---|
protectedWords | The , daily , huddle , on , t , He ,
Stand , Up , App , 2 |
delimiterOptions | The , daily , huddle , on , 2 |