Docs Menu
Docs Home
/
MongoDB Atlas
/ / / / /

Tokenizers

On this page

  • edgeGram
  • Attributes
  • Example
  • keyword
  • Attributes
  • Example
  • nGram
  • Attributes
  • Example
  • regexCaptureGroup
  • Attributes
  • Example
  • regexSplit
  • Attributes
  • Example
  • standard
  • Attributes
  • Example
  • uaxUrlEmail
  • Attributes
  • Example
  • whitespace
  • Attributes
  • Example

A custom analyzer's tokenizer determines how Atlas Search splits up text into discrete chunks for indexing. Tokenizers require a type field, and some take additional options as well.

"tokenizer": {
"type": "<tokenizer-type>",
"<additional-option>": "<value>"
}

The following sample index definitions and queries use the sample collection named minutes. If you add the minutes collection to a database in your Atlas cluster, you can create the following sample indexes from the Visual Editor or JSON Editor in the Atlas UI and run the sample queries against this collection. To create these indexes, after you select your preferred configuration method in the Atlas UI, select the database and collection, and refine your index to add custom analyzers that use tokenizers.

Note

When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.

Name

Label that identifies the custom analyzer.

Used In

Fields that use the custom analyzer. Value is None if custom analyzer isn't used to analyze any fields.

Character Filters

Atlas Search character filters configured in the custom analyzer.

Tokenizer

Atlas Search tokenizer configured in the custom analyzer.

Token Filters

Atlas Search token filters configured in the custom analyzer.

Actions

Clickable icons that indicate the actions that you can perform on the custom analyzer.

  • Click to edit the custom analyzer.

  • Click to delete the custom analyzer.

The edgeGram tokenizer tokenizes input from the left side, or "edge", of a text input into n-grams of given sizes. You can't use a custom analyzer with edgeGram tokenizer in the analyzer field for synonym or autocomplete field mapping definitions.

It has the following attributes:

Note

The edgeGram tokenizer yields multiple output tokens per word and across words in input text, producing token graphs.

Because autocomplete field type mapping definitions and analyzers with synonym mappings only work when used with non-graph-producing tokenizers, you can't use a custom analyzer with edgeGram tokenizer in the analyzer field for autocomplete field type mapping definitions or analyzers with synonym mappings.

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be edgeGram.

minGram

integer

yes

Number of characters to include in the shortest token created.

maxGram

integer

yes

Number of characters to include in the longest token created.

The following index definition indexes the message field in the minutes collection using a custom analyzer named edgegramExample. It uses the edgeGram tokenizer to create tokens (searchable terms) between 2 and 7 characters long starting from the first character on the left side of words in the message field.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type edgegramExample in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select edgeGram from the dropdown and type the value for the following fields:

    Field
    Value

    minGram

    2

    maxGram

    7

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.

  8. Select message from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select edgegramExample from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"message": {
"analyzer": "edgegramExample",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "edgegramExample",
"tokenFilters": [],
"tokenizer": {
"maxGram": 7,
"minGram": 2,
"type": "edgeGram"
}
}
]
}

The following query searches the message field in the minutes collection for text that begin with tr.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "tr",
6 "path": "message"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "message": 1
14 }
15 }
16])
{ _id: 1, message: 'try to siGn-In' },
{ _id: 3, message: 'try to sign-in' }

Atlas Search returns documents with _id: 1 and _id: 3 in the results because Atlas Search created a token with the value tr using the edgeGram tokenizer for the documents, which matches the search term. If you index the message field using the standard tokenizer, Atlas Search would not return any results for the search term tr.

The following table shows the tokens that the edgeGram tokenizer and by comparison, the standard tokenizer, create for the documents in the results:

Tokenizer
Token Outputs

standard

try, to, sign, in

edgeGram

tr, try, try{SPACE}, try t, try to, try to{SPACE}

The keyword tokenizer tokenizes the entire input as a single token. Atlas Search doesn't index string fields that exceed 32766 characters using the keyword tokenizer.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be keyword.

The following index definition indexes the message field in the minutes collection using a custom analyzer named keywordExample. It uses the keyword tokenizer to create a token (searchable terms) on the entire field as a single term.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type keywordExample in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select keyword from the dropdown.

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.

  8. Select message from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select keywordExample from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"message": {
"analyzer": "keywordExample",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "keywordExample",
"tokenFilters": [],
"tokenizer": {
"type": "keyword"
}
}
]
}

The following query searches the message field in the minutes collection for the phrase try to sign-in.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "try to sign-in",
6 "path": "message"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "message": 1
14 }
15 }
16])
{ _id: 3, message: 'try to sign-in' }

Atlas Search returns the document with _id: 3 in the results because Atlas Search created a token with the value try to sign-in using the keyword tokenizer for the documents, which matches the search term. If you index the message field using the standard tokenizer, Atlas Search returns documents with _id: 1, _id: 2 and _id: 3 for the search term try to sign-in because each document contains some of the tokens the standard tokenizer creates.

The following table shows the tokens that the keyword tokenizer and by comparison, the standard tokenizer, create for the document with _id: 3:

Tokenizer
Token Outputs

standard

try, to, sign, in

keyword

try to sign-in

The nGram tokenizer tokenizes into text chunks, or "n-grams", of given sizes. You can't use a custom analyzer with nGram tokenizer in the analyzer field for synonym or autocomplete field mapping definitions.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be nGram.

minGram

integer

yes

Number of characters to include in the shortest token created.

maxGram

integer

yes

Number of characters to include in the longest token created.

The following index definition indexes the title field in the minutes collection using a custom analyzer named ngramExample. It uses the nGram tokenizer to create tokens (searchable terms) between 4 and 6 characters long in the title field.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type ngramExample in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select nGram from the dropdown and type the value for the following fields:

    Field
    Value

    minGram

    4

    maxGram

    6

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the title field.

  8. Select title from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select ngramExample from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"title": {
"analyzer": "ngramExample",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "ngramExample",
"tokenFilters": [],
"tokenizer": {
"maxGram": 6,
"minGram": 4,
"type": "nGram"
}
}
]
}

The following query searches the title field in the minutes collection for the term week.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "week",
6 "path": "title"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "title": 1
14 }
15 }
16])
{ _id: 1, title: "The team's weekly meeting" }

Atlas Search returns the document with _id: 1 in the results because Atlas Search created a token with the value week using the nGram tokenizer for the documents, which matches the search term. If you index the title field using the standard or edgeGram tokenizer, Atlas Search would not return any results for the search term week.

The following table shows the tokens that the nGram tokenizer and by comparison, the standard and edgeGram tokenizer create for the document with _id: 1:

Tokenizer
Token Outputs

standard

The, team's, weekly, meeting

edgeGram

The{SPACE}, The t, The te

nGram

The{SPACE}, The t, The te, he t, ... , week, weekl, weekly, eekl, ..., eetin, eeting, etin, eting, ting

The regexCaptureGroup tokenizer matches a regular expression pattern to extract tokens.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be regexCaptureGroup.

pattern

string

yes

Regular expression to match against.

group

integer

yes

Index of the character group within the matching expression to extract into tokens. Use 0 to extract all character groups.

The following index definition indexes the page_updated_by.phone field in the minutes collection using a custom analyzer named phoneNumberExtractor. It uses the following:

  • mappings character filter to remove parenthesis around the first three digits and replace all spaces and periods with dashes

  • regexCaptureGroup tokenizer to create a single token from the first US-formatted phone number present in the text input

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type phoneNumberExtractor in the Analyzer Name field.

  4. Expand Character Filters and click Add character filter.

  5. Select mapping from the dropdown and click Add mapping.

  6. Enter the following characters in the Original field, one at a time, and leave the corresponding Replacement field empty.

    Original
    Replacement

    -

    .

    (

    )

  7. Click Add character filter.

  8. Expand Tokenizer if it's collapsed.

  9. Select regexCaptureGroup from the dropdown and type the value for the following fields:

    Field
    Value

    pattern

    ^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$

    group

    0

  10. Click Add to add the custom analyzer to your index.

  11. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone field.

  12. Select page_updated_by.phone from the Field Name dropdown and String from the Data Type dropdown.

  13. In the properties section for the data type, select phoneNumberExtractor from the Index Analyzer and Search Analyzer dropdowns.

  14. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"page_updated_by": {
"fields": {
"phone": {
"analyzer": "phoneNumberExtractor",
"type": "string"
}
},
"type": "document"
}
}
},
"analyzers": [
{
"charFilters": [
{
"mappings": {
" ": "-",
"(": "",
")": "",
".": "-"
},
"type": "mapping"
}
],
"name": "phoneNumberExtractor",
"tokenFilters": [],
"tokenizer": {
"group": 0,
"pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$",
"type": "regexCaptureGroup"
}
}
]
}

The following query searches the page_updated_by.phone field in the minutes collection for the phone number 123-456-9870.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "123-456-9870",
6 "path": "page_updated_by.phone"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "page_updated_by.phone": 1
14 }
15 }
16])
{ _id: 3, page_updated_by: { phone: '(123).456.9870' }

Atlas Search returns the document with _id: 3 in the results because Atlas Search created a token with the value 123-456-7890 using the regexCaptureGroup tokenizer for the documents, which matches the search term. If you index the page_updated_by.phone field using the standard tokenizer, Atlas Search returns all of the documents for the search term 123-456-7890.

The following table shows the tokens that the regexCaptureGroup tokenizer and by comparison, the standard tokenizer, create for the document with _id: 3:

Tokenizer
Token Outputs

standard

123, 456.9870

regexCaptureGroup

123-456-9870

The regexSplit tokenizer splits tokens with a regular-expression based delimiter.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be regexSplit.

pattern

string

yes

Regular expression to match against.

The following index definition indexes the page_updated_by.phone field in the minutes collection using a custom analyzer named dashDotSpaceSplitter. It uses the regexSplit tokenizer to create tokens (searchable terms) from one or more hyphens, periods and spaces on the page_updated_by.phone field.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type dashDotSpaceSplitter in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select regexSplit from the dropdown and type the value for the following field:

    Field
    Value

    pattern

    [-. ]+

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.phone field.

  8. Select page_updated_by.phone from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select dashDotSpaceSplitter from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"page_updated_by": {
"fields": {
"phone": {
"analyzer": "dashDotSpaceSplitter",
"type": "string"
}
},
"type": "document"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "dashDotSpaceSplitter",
"tokenFilters": [],
"tokenizer": {
"pattern": "[-. ]+",
"type": "regexSplit"
}
}
]
}

The following query searches the page_updated_by.phone field in the minutes collection for the digits 9870.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "9870",
6 "path": "page_updated_by.phone"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "page_updated_by.phone": 1
14 }
15 }
16])
{ _id: 3, page_updated_by: { phone: '(123).456.9870' }

Atlas Search returns the document with _id: 3 in the results because Atlas Search created a token with the value 9870 using the regexSplit tokenizer for the documents, which matches the search term. If you index the page_updated_by.phone field using the standard tokenizer, Atlas Search would not return any results for the search term 9870.

The following table shows the tokens that the regexCaptureGroup tokenizer and by comparison, the standard tokenizer, create for the document with _id: 3:

Tokenizer
Token Outputs

standard

123, 456.9870

regexSplit

(123), 456, 9870

The standard tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be standard.

maxTokenLength

integer

no

Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.

Default: 255

The following index definition indexes the message field in the minutes collection using a custom analyzer named standardExample. It uses the standard tokenizer and the stopword token filter.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type standardExample in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select standard from the dropdown.

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.

  8. Select message from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select standardExample from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"message": {
"analyzer": "standardExample",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "standardExample",
"tokenFilters": [],
"tokenizer": {
"type": "standard"
}
}
]
}

The following query searches the message field in the minutes collection for the term signature.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "signature",
6 "path": "message"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "message": 1
14 }
15 }
16])
{ _id: 4, message: 'write down your signature or phone â„–' }

Atlas Search returns the document with _id: 4 because Atlas Search created a token with the value signature using the standard tokenizer for the documents, which matches the search term. If you index the message field using the keyword tokenizer, Atlas Search would not return any results for the search term signature.

The following table shows the tokens that the standard tokenizer and by comparison, the keyword analyzer, create for the document with _id: 4:

Tokenizer
Token Outputs

standard

write, down, your, signature, or, phone

keyword

write down your signature or phone â„–

The uaxUrlEmail tokenizer tokenizes URLs and email addresses. Although uaxUrlEmail tokenizer tokenizes based on word break rules from the Unicode Text Segmentation algorithm, we recommend using uaxUrlEmail tokenizer only when the indexed field value includes URLs and email addresses. For fields that don't include URLs or email addresses, use the standard tokenizer to create tokens based on word break rules.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be uaxUrlEmail.

maxTokenLength

int

no

Maximum number of characters in one token.

Default: 255

The following index definition indexes the page_updated_by.email field in the minutes collection using a custom analyzer named basicEmailAddressAnalyzer. It uses the uaxUrlEmail tokenizer to create tokens (searchable terms) from URLs and email addresses in the page_updated_by.email field.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type basicEmailAddressAnalyzer in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select uaxUrlEmail from the dropdown.

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email field.

  8. Select page_updated_by.email from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select basicEmailAddressAnalyzer from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"fields": {
"page_updated_by": {
"fields": {
"email": {
"analyzer": "basicEmailAddressAnalyzer",
"type": "string"
}
},
"type": "document"
}
}
},
"analyzers": [
{
"name": "basicEmailAddressAnalyzer",
"tokenizer": {
"type": "uaxUrlEmail"
}
}
]
}

The following query searches the page_updated_by.email field in the minutes collection for the email lewinsky@example.com.

db.minutes.aggregate([
{
"$search": {
"text": {
"query": "lewinsky@example.com",
"path": "page_updated_by.email"
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.email": 1
}
}
])
{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }

Atlas Search returns the document with _id: 3 in the results because Atlas Search created a token with the value lewinsky@example.com using the uaxUrlEmail tokenizer for the documents, which matches the search term. If you index the page_updated_by.email field using the standard tokenizer, Atlas Search returns all the documents for the search term lewinsky@example.com.

The following table shows the tokens that the uaxUrlEmail tokenizer and by comparison, the standard tokenizer, create for the document with _id: 3:

Tokenizer
Token Outputs

standard

lewinsky, example.com

uaxUrlEmail

lewinsky@example.com

The following index definition indexes the page_updated_by.email field in the minutes collection using a custom analyzer named emailAddressAnalyzer. It uses the following:

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type emailAddressAnalyzer in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select uaxUrlEmail from the dropdown.

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the page_updated_by.email field.

  8. Select page_updated_by.email from the Field Name dropdown and Autocomplete from the Data Type dropdown.

  9. In the properties section for the data type, select emailAddressAnalyzer from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"fields": {
"page_updated_by": {
"fields": {
"email": {
"analyzer": "emailAddressAnalyzer",
"tokenization": "edgeGram",
"type": "autocomplete"
}
},
"type": "document"
}
}
},
"analyzers": [
{
"name": "emailAddressAnalyzer",
"tokenizer": {
"type": "uaxUrlEmail"
}
}
]
}

The following query searches the page_updated_by.email field in the minutes collection for the term exam.

db.minutes.aggregate([
{
"$search": {
"autocomplete": {
"query": "lewinsky@example.com",
"path": "page_updated_by.email"
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.email": 1
}
}
])
_id: 3, page_updated_by: { email: 'lewinsky@example.com' } }

Atlas Search returns the document with _id: 3 in the results because Atlas Search created a token with the value lewinsky@example.com using the uaxUrlEmail tokenizer for the documents, which matches the search term. If you index the page_updated_by.email field using the standard tokenizer, Atlas Search returns all the documents for the search term lewinsky@example.com.

The following table shows the tokens that the uaxUrlEmail tokenizer and by comparison, the standard tokenizer, create for the document with _id: 3:

Tokenizer
Atlas Search Field Type
Token Outputs

standard

autocomplete edgeGram

le, lew, lewi, lewin, lewins, lewinsk, lewinsky, lewinsky@, lewinsky, ex, exa, exam, examp, exampl, example, example., example.c, example.co, example.com

uaxUrlEmail

autocomplete edgeGram

le, lew, lewi, lewin, lewins, lewinsk, lewinsky, lewinsky@, lewinsky@e, lewinsky@ex, lewinsky@exa, lewinsky@exam, lewinsky@examp, lewinsky@exampl

The whitespace tokenizer tokenizes based on occurrences of whitespace between words.

It has the following attributes:

Name
Type
Required?
Description

type

string

yes

Human-readable label that identifies this tokenizer type. Value must be whitespace.

maxTokenLength

integer

no

Maximum length for a single token. Tokens greater than this length are split at maxTokenLength into multiple tokens.

Default: 255

The following index definition indexes the message field in the minutes collection using a custom analyzer named whitespaceExample. It uses the whitespace tokenizer to create tokens (searchable terms) from any whitespaces in the message field.

  1. In the Custom Analyzers section, click Add Custom Analyzer.

  2. Select the Create Your Own radio button and click Next.

  3. Type whitespaceExample in the Analyzer Name field.

  4. Expand Tokenizer if it's collapsed.

  5. Select whitespace from the dropdown.

  6. Click Add to add the custom analyzer to your index.

  7. In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the message field.

  8. Select message from the Field Name dropdown and String from the Data Type dropdown.

  9. In the properties section for the data type, select whitespaceExample from the Index Analyzer and Search Analyzer dropdowns.

  10. Click Add, then Save Changes.

Replace the default index definition with the following:

{
"mappings": {
"dynamic": true,
"fields": {
"message": {
"analyzer": "whitespaceExample",
"type": "string"
}
}
},
"analyzers": [
{
"charFilters": [],
"name": "whitespaceExample",
"tokenFilters": [],
"tokenizer": {
"type": "whitespace"
}
}
]
}

The following query searches the message field in the minutes collection for the term SIGN-IN.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "SIGN-IN",
6 "path": "message"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "message": 1
14 }
15 }
16])
{ _id: 2, message: 'do not forget to SIGN-IN' }

Atlas Search returns the document with _id: 2 in the results because Atlas Search created a token with the value SIGN-IN using the whitespace tokenizer for the documents, which matches the search term. If you index the message field using the standard tokenizer, Atlas Search returns documents with _id: 1, _id: 2 and _id: 3 for the search term SIGN-IN.

The following table shows the tokens that the whitespace tokenizer and by comparison, the standard tokenizer, create for the document with _id: 2:

Tokenizer
Token Outputs

standard

do, not, forget, to, sign, in

whitespace

do, not, forget, to, SIGN-IN

Back

Character Filters