Tokenizers
On this page
A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing. Tokenizers require a type field, and some take additional options as well.
"tokenizer": { "type": "<tokenizer-type>", "<additional-option>": "<value>" }
The following sample index definitions and queries use the sample
collection named minutes
. If you add
the minutes
collection to a database in your Atlas cluster, you
can create the following sample indexes from the Visual Editor or JSON
Editor in the Atlas UI and run the sample queries against this
collection. To create these indexes, after you select your preferred
configuration method in the Atlas UI, select the database and
collection, and refine your index to add custom analyzers that use
tokenizers.
Note
When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.
Name | Label that identifies the custom analyzer. |
Used In | Fields that use the custom analyzer. Value is None if custom analyzer isn't used to analyze any fields. |
Character Filters | Atlas Search character filters configured in the custom analyzer. |
Tokenizer | Atlas Search tokenizer configured in the custom analyzer. |
Token Filters | Atlas Search token filters configured in the custom analyzer. |
Actions | Clickable icons that indicate the actions that you can perform on the custom analyzer.
|
edgeGram
The edgeGram
tokenizer tokenizes input from the left side, or
"edge", of a text input into n-grams of given sizes. You can't use a
custom analyzer with edgeGram tokenizer
in the analyzer
field for synonym or
autocomplete field mapping
definitions.
Attributes
It has the following attributes:
Note
The edgeGram
tokenizer yields multiple output tokens per word and across words
in input text, producing token graphs.
Because autocomplete field type mapping definitions
and analyzers with synonym mappings only work when used with
non-graph-producing tokenizers, you can't use a custom analyzer with
edgeGram tokenizer in the analyzer
field for
autocomplete
field type mapping definitions or analyzers with synonym mappings.
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | yes | Number of characters to include in the shortest token created. |
| integer | yes | Number of characters to include in the longest token created. |
Example
The following index definition indexes the message
field in the
minutes
collection using a custom analyzer named
edgegramExample
. It uses the edgeGram
tokenizer to create
tokens (searchable terms) between 2
and 7
characters long
starting from the first character on the left side of words in the
message
field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
edgegramExample
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select edgeGram from the dropdown and type the value for the following fields:
FieldValueminGram
2
maxGram
7
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
edgegramExample
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "edgegramExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "edgegramExample", "tokenFilters": [], "tokenizer": { "maxGram": 7, "minGram": 2, "type": "edgeGram" } } ] }
The following query searches the message
field in
the minutes
collection for text that begin with tr
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "tr", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 1, message: 'try to siGn-In' }, { _id: 3, message: 'try to sign-in' }
Atlas Search returns documents with _id: 1
and _id: 3
in the
results because Atlas Search created a token with the value tr
using the
edgeGram
tokenizer for the documents, which matches the search
term. If you index the message
field using the standard
tokenizer, Atlas Search would not return any results for the search term
tr
.
The following table shows the tokens that the edgeGram
tokenizer
and by comparison, the standard
tokenizer, create for the
documents in the results:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
keyword
The keyword
tokenizer tokenizes the entire input as a single token.
Atlas Search doesn't index string fields that exceed 32766 characters using the
keyword
tokenizer.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
Example
The following index definition indexes the message
field in the
minutes
collection using a custom analyzer named
keywordExample
. It uses the keyword
tokenizer to create
a token (searchable terms) on the entire field as a single term.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
keywordExample
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select keyword from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
keywordExample
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "keywordExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "keywordExample", "tokenFilters": [], "tokenizer": { "type": "keyword" } } ] }
The following query searches the message
field in
the minutes
collection for the phrase try to sign-in
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "try to sign-in", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 3, message: 'try to sign-in' }
Atlas Search returns the document with _id: 3
in the results because Atlas Search
created a token with the value try to sign-in
using the
keyword
tokenizer for the documents, which matches the search
term. If you index the message
field using the standard
tokenizer, Atlas Search returns documents with _id: 1
,
_id: 2
and _id: 3
for the search term try to sign-in
because each document contains some of the tokens the standard
tokenizer creates.
The following table shows the tokens that the keyword
tokenizer
and by comparison, the standard
tokenizer, create for the
document with _id: 3
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
nGram
The nGram
tokenizer tokenizes into text chunks, or "n-grams", of
given sizes. You can't use a custom analyzer with nGram tokenizer in the analyzer
field for
synonym or autocomplete field mapping definitions.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | yes | Number of characters to include in the shortest token created. |
| integer | yes | Number of characters to include in the longest token created. |
Example
The following index definition indexes the title
field in the
minutes
collection using a custom analyzer named
ngramExample
. It uses the nGram
tokenizer to create
tokens (searchable terms) between 4
and 6
characters long
in the title
field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
ngramExample
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select nGram from the dropdown and type the value for the following fields:
FieldValueminGram
4
maxGram
6
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.
Select title from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
ngramExample
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "title": { "analyzer": "ngramExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "ngramExample", "tokenFilters": [], "tokenizer": { "maxGram": 6, "minGram": 4, "type": "nGram" } } ] }
The following query searches the title
field in
the minutes
collection for the term week
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "week", 6 "path": "title" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "title": 1 14 } 15 } 16 ])
{ _id: 1, title: "The team's weekly meeting" }
Atlas Search returns the document with _id: 1
in the results because Atlas Search
created a token with the value week
using the nGram
tokenizer
for the documents, which matches the search term. If you index the
title
field using the standard
or edgeGram
tokenizer,
Atlas Search would not return any results for the search term week
.
The following table shows the tokens that the nGram
tokenizer
and by comparison, the standard
and edgeGram
tokenizer create
for the document with _id: 1
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
|
|
regexCaptureGroup
The regexCaptureGroup
tokenizer matches a regular expression
pattern to extract tokens.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| string | yes | Regular expression to match against. |
| integer | yes | Index of the character group within the matching expression to
extract into tokens. Use |
Example
The following index definition indexes the page_updated_by.phone
field in the minutes
collection using a custom analyzer named
phoneNumberExtractor
. It uses the following:
mappings
character filter to remove parenthesis around the first three digits and replace all spaces and periods with dashesregexCaptureGroup
tokenizer to create a single token from the first US-formatted phone number present in the text input
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
phoneNumberExtractor
in the Analyzer Name field.Expand Character Filters and click Add character filter.
Select mapping from the dropdown and click Add mapping.
Enter the following characters in the Original field, one at a time, and leave the corresponding Replacement field empty.
OriginalReplacement-
.
(
)
Click Add character filter.
Expand Tokenizer if it's collapsed.
Select regexCaptureGroup from the dropdown and type the value for the following fields:
FieldValuepattern
^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$
group
0
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone field.
Select page_updated_by.phone from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
phoneNumberExtractor
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "page_updated_by": { "fields": { "phone": { "analyzer": "phoneNumberExtractor", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [ { "mappings": { " ": "-", "(": "", ")": "", ".": "-" }, "type": "mapping" } ], "name": "phoneNumberExtractor", "tokenFilters": [], "tokenizer": { "group": 0, "pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$", "type": "regexCaptureGroup" } } ] }
The following query searches the page_updated_by.phone
field in
the minutes
collection for the phone number 123-456-9870
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "123-456-9870", 6 "path": "page_updated_by.phone" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.phone": 1 14 } 15 } 16 ])
{ _id: 3, page_updated_by: { phone: '(123).456.9870' }
Atlas Search returns the document with _id: 3
in the results because Atlas Search
created a token with the value 123-456-7890
using the
regexCaptureGroup
tokenizer for the documents, which matches the
search term. If you index the page_updated_by.phone
field using
the standard
tokenizer, Atlas Search returns all of the documents for
the search term 123-456-7890
.
The following table shows the tokens that the regexCaptureGroup
tokenizer
and by comparison, the standard
tokenizer, create for the
document with _id: 3
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
regexSplit
The regexSplit
tokenizer splits tokens with a regular-expression
based delimiter.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| string | yes | Regular expression to match against. |
Example
The following index definition indexes the page_updated_by.phone
field in the minutes
collection using a custom analyzer named
dashDotSpaceSplitter
. It uses the regexSplit
tokenizer to
create tokens (searchable terms) from one or more hyphens, periods
and spaces on the page_updated_by.phone
field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
dashDotSpaceSplitter
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select regexSplit from the dropdown and type the value for the following field:
FieldValuepattern
[-. ]+
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone field.
Select page_updated_by.phone from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
dashDotSpaceSplitter
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "page_updated_by": { "fields": { "phone": { "analyzer": "dashDotSpaceSplitter", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "charFilters": [], "name": "dashDotSpaceSplitter", "tokenFilters": [], "tokenizer": { "pattern": "[-. ]+", "type": "regexSplit" } } ] }
The following query searches the page_updated_by.phone
field in
the minutes
collection for the digits 9870
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "9870", 6 "path": "page_updated_by.phone" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "page_updated_by.phone": 1 14 } 15 } 16 ])
{ _id: 3, page_updated_by: { phone: '(123).456.9870' }
Atlas Search returns the document with _id: 3
in the results because Atlas Search
created a token with the value 9870
using the regexSplit
tokenizer for the documents, which matches the search term. If you
index the page_updated_by.phone
field using the standard
tokenizer, Atlas Search would not return any results for the search term
9870
.
The following table shows the tokens that the regexCaptureGroup
tokenizer and by comparison, the standard
tokenizer, create for
the document with _id: 3
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
standard
The standard
tokenizer tokenizes based on word break rules from the
Unicode Text Segmentation algorithm.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | no | Maximum length for a single token. Tokens greater than this
length are split at Default: |
Example
The following index definition indexes the message
field in the minutes
collection using a custom analyzer named
standardExample
. It uses the standard
tokenizer and the
stopword token filter.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
standardExample
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select standard from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
standardExample
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "standardExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "standardExample", "tokenFilters": [], "tokenizer": { "type": "standard" } } ] }
The following query searches the message
field in
the minutes
collection for the term signature
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "signature", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 4, message: 'write down your signature or phone â„–' }
Atlas Search returns the document with _id: 4
because Atlas Search
created a token with the value signature
using the standard
tokenizer for the documents, which matches the search term. If you
index the message
field using the keyword
tokenizer, Atlas Search
would not return any results for the search term signature
.
The following table shows the tokens that the standard
tokenizer and by comparison, the keyword
analyzer, create for
the document with _id: 4
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
uaxUrlEmail
The uaxUrlEmail
tokenizer tokenizes URLs and email addresses.
Although uaxUrlEmail
tokenizer tokenizes based on word break rules
from the Unicode Text Segmentation
algorithm,
we recommend using uaxUrlEmail
tokenizer only when the indexed
field value includes URLs and email
addresses. For fields that don't include URLs or email addresses, use
the standard tokenizer to create tokens based on
word break rules.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| int | no | Maximum number of characters in one token. Default: |
Example
The following index definition indexes the page_updated_by.email
field in the minutes
collection using a custom analyzer named
basicEmailAddressAnalyzer
. It uses the uaxUrlEmail
tokenizer to
create tokens (searchable terms) from URLs and email
addresses in the page_updated_by.email
field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
basicEmailAddressAnalyzer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select uaxUrlEmail from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email field.
Select page_updated_by.email from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
basicEmailAddressAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "fields": { "page_updated_by": { "fields": { "email": { "analyzer": "basicEmailAddressAnalyzer", "type": "string" } }, "type": "document" } } }, "analyzers": [ { "name": "basicEmailAddressAnalyzer", "tokenizer": { "type": "uaxUrlEmail" } } ] }
The following query searches the page_updated_by.email
field in
the minutes
collection for the email lewinsky@example.com
.
db.minutes.aggregate([ { "$search": { "text": { "query": "lewinsky@example.com", "path": "page_updated_by.email" } } }, { "$project": { "_id": 1, "page_updated_by.email": 1 } } ])
{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }
Atlas Search returns the document with _id: 3
in the results
because Atlas Search created a token with the value
lewinsky@example.com
using the uaxUrlEmail
tokenizer
for the documents, which matches the search term. If you
index the page_updated_by.email
field using the
standard
tokenizer, Atlas Search returns all the documents for
the search term lewinsky@example.com
.
The following table shows the tokens that the uaxUrlEmail
tokenizer and by comparison, the standard
tokenizer, create for
the document with _id: 3
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|
The following index definition indexes the page_updated_by.email
field in the minutes
collection using a custom analyzer named
emailAddressAnalyzer
. It uses the following:
The autocomplete type with an
edgeGram
tokenization strategyThe
uaxUrlEmail
tokenizer to create tokens (searchable terms) from URLs and email addresses
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
emailAddressAnalyzer
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select uaxUrlEmail from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email field.
Select page_updated_by.email from the Field Name dropdown and Autocomplete from the Data Type dropdown.
In the properties section for the data type, select
emailAddressAnalyzer
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "fields": { "page_updated_by": { "fields": { "email": { "analyzer": "emailAddressAnalyzer", "tokenization": "edgeGram", "type": "autocomplete" } }, "type": "document" } } }, "analyzers": [ { "name": "emailAddressAnalyzer", "tokenizer": { "type": "uaxUrlEmail" } } ] }
The following query searches the page_updated_by.email
field in
the minutes
collection for the term exam
.
db.minutes.aggregate([ { "$search": { "autocomplete": { "query": "lewinsky@example.com", "path": "page_updated_by.email" } } }, { "$project": { "_id": 1, "page_updated_by.email": 1 } } ])
_id: 3, page_updated_by: { email: 'lewinsky@example.com' } }
Atlas Search returns the document with _id: 3
in the results
because Atlas Search created a token with the value
lewinsky@example.com
using the uaxUrlEmail
tokenizer
for the documents, which matches the search term. If you
index the page_updated_by.email
field using the
standard
tokenizer, Atlas Search returns all the documents for
the search term lewinsky@example.com
.
The following table shows the tokens that the uaxUrlEmail
tokenizer and by comparison, the standard
tokenizer, create for
the document with _id: 3
:
Tokenizer | Atlas Search Field Type | Token Outputs |
---|---|---|
|
|
|
|
|
|
whitespace
The whitespace
tokenizer tokenizes based on occurrences of
whitespace between words.
Attributes
It has the following attributes:
Name | Type | Required? | Description |
---|---|---|---|
| string | yes | Human-readable label that identifies this tokenizer type.
Value must be |
| integer | no | Maximum length for a single token. Tokens greater than this
length are split at Default: |
Example
The following index definition indexes the message
field in the minutes
collection using a custom analyzer named whitespaceExample
. It uses the whitespace
tokenizer to
create tokens (searchable terms) from any whitespaces in the
message
field.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
whitespaceExample
in the Analyzer Name field.Expand Tokenizer if it's collapsed.
Select whitespace from the dropdown.
Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.
Select message from the Field Name dropdown and String from the Data Type dropdown.
In the properties section for the data type, select
whitespaceExample
from the Index Analyzer and Search Analyzer dropdowns.Click Add, then Save Changes.
Replace the default index definition with the following:
{ "mappings": { "dynamic": true, "fields": { "message": { "analyzer": "whitespaceExample", "type": "string" } } }, "analyzers": [ { "charFilters": [], "name": "whitespaceExample", "tokenFilters": [], "tokenizer": { "type": "whitespace" } } ] }
The following query searches the message
field in
the minutes
collection for the term SIGN-IN
.
1 db.minutes.aggregate([ 2 { 3 "$search": { 4 "text": { 5 "query": "SIGN-IN", 6 "path": "message" 7 } 8 } 9 }, 10 { 11 "$project": { 12 "_id": 1, 13 "message": 1 14 } 15 } 16 ])
{ _id: 2, message: 'do not forget to SIGN-IN' }
Atlas Search returns the document with _id: 2
in the results because Atlas Search
created a token with the value SIGN-IN
using the whitespace
tokenizer for the documents, which matches the search term. If you
index the message
field using the standard
tokenizer, Atlas Search
returns documents with _id: 1
, _id: 2
and _id: 3
for the
search term SIGN-IN
.
The following table shows the tokens that the whitespace
tokenizer and by comparison, the standard
tokenizer, create for
the document with _id: 2
:
Tokenizer | Token Outputs |
---|---|
|
|
|
|