分词器

自定义分析器的分词器决定 Atlas Search 如何将文本分割为数据段以进行索引。分词器需要一个类型字段，有些还支持额外的选项。

语法

"tokenizer": {
  "type": "<tokenizer-type>",
  "<additional-option>": "<value>"
}

分词器类型

Atlas Search支持以下类型的分词器：

edgeGram
keyword
nGram
regexCaptureGroup
regexSplit
标准
uaxUrlEmail
whitespace

以下示例索引定义和查询使用名为 minutes 的示例集合。要跟随这些示例，请在集群上加载 minutes集合，并按照创建Atlas Search索引教程中的步骤导航到Atlas用户界面中的 Create a Search Index 页面。然后，选择 minutes集合作为数据源，并按照示例步骤在 Visual Editor 或 JSON editor 中创建索引。

edgeGram

edgeGram 分词器将文本输入左侧（或“边缘”）的输入标记为给定大小的 n 元模型。您无法在同义词或自动完成字段映射定义的 analyzer 字段中将自定义分析器与 edgeGram 分词器结合使用。

属性

它具有以下属性：

注意

edgeGram 分词器可按单词或跨输入文本中的单词生成多个输出标记，从而生成标记图。

由于自动完成字段类型映射定义和具有同义词映射的分析器仅在与不生成图的分词器一起使用时才有效，因此不能将 analyzer 字段中带有 edgeGram 分词器的自定义分析器用于自动完成字段类型映射定义或带有同义词映射的分析器。

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `edgeGram`。
`minGram`	整型	是	要包含在创建的最短词元中的字符数。
`maxGram`	整型	是	要包含在创建的最长词元中的字符数。

例子

以下索引定义使用名为 edgegramExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 edgeGram 分词器创建 2 到 7 个字符长的词元（可搜索词），从 message 字段中单词左侧的第一个字符开始。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 edgegramExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 edgeGram，然后为以下字段键入值：
字段
值
minGram
2
maxGram
7
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 edgegramExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "edgegramExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "edgegramExample",
      "tokenFilters": [],
      "tokenizer": {
        "maxGram": 7,
        "minGram": 2,
        "type": "edgeGram"
      }
    }
  ]
}

下面的查询在 minutes 集合的 message 字段中搜索以 tr 开头的文本。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "tr",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 1, message: 'try to siGn-In' },
{ _id: 3, message: 'try to sign-in' }

Atlas Search 在结果中返回具有 _id: 1 和 _id: 3 的文档，因为 Atlas Search 使用文档的 edgeGram 分词器创建了一个值为 tr 的词元，该词元与搜索词匹配。如果使用 standard 分词器为 message 字段建立索引，Atlas Search 将不会返回搜索词 tr 的任何结果。

下表列出了 edgeGram 分词器以及与之相比较的 standard 分词器为结果中的文档创建的词元：

分词器	Token Outputs
`standard`	`try`, `to` , `sign` , `in`
`edgeGram`	`tr`, `try` , `try{SPACE}` , `try t` , `try to` , `try to{SPACE}`

keyword

keyword分词器将整个输入标记为单个词元。Atlas Search 不会使用keyword分词器对超过 32766 个字符的字符串字段编制索引。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `keyword`。

例子

以下索引定义使用名为 keywordExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 keyword 分词器在整个字段上创建作为单个术语的词元（可搜索词）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 keywordExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 keyword。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 keywordExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "keywordExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "keywordExample",
      "tokenFilters": [],
      "tokenizer": {
        "type": "keyword"
      }
    }
  ]
}

下面的查询在 minutes 集合的 message 字段中搜索词语 try to sign-in。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "try to sign-in",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 3, message: 'try to sign-in' }

Atlas Search 会在结果中返回具有 _id: 3 的文档，因为 Atlas Search 已使用文档的 keyword 分词器创建了一个值为 try to sign-in 的词元，该词元与搜索词匹配。如果您使用 standard 分词器为 message 字段建立索引，则 Atlas Search 将针对搜索词 try to sign-in 返回具有 _id: 1、_id: 2 和 _id: 3 的文档，因为每个文档都包含 standard 分词器创建的一些词元。

下表列出了 keyword 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`try`, `to` , `sign` , `in`
`keyword`	`try to sign-in`

nGram

nGram 分词器标记为给定大小的文本块或“n-grams”。对于同义词或自动完成字段映射定义，不能在 analyzer 字段中使用带有 nGram 分词器的自定义分析器。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `nGram`。
`minGram`	整型	是	要包含在创建的最短词元中的字符数。
`maxGram`	整型	是	要包含在创建的最长词元中的字符数。

例子

以下索引定义使用名为 ngramExample 的自定义分析器为 minutes 集合中的 title 字段建立索引，并使用 nGram 分词器在 title 字段中创建长度介于 4 至 6 个字符之间的词元（可搜索术语）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 ngramExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 nGram，然后为以下字段键入值：
字段
值
minGram
4
maxGram
6
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以对标题字段应用自定义分析器。
从 Field Name 下拉列表中选择 Title（标题），然后从 Data Type 下拉列表中选择 String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 ngramExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "title": {
        "analyzer": "ngramExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "ngramExample",
      "tokenFilters": [],
      "tokenizer": {
        "maxGram": 6,
        "minGram": 4,
        "type": "nGram"
      }
    }
  ]
}

下面的查询在 minutes 集合的 title 字段中搜索词语 week。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "week",
6         "path": "title"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "title": 1
14     }
15   }
16 ])

{ _id: 1, title: "The team's weekly meeting" }

Atlas Search 会在结果中返回具有 _id: 1 的文档，因为 Atlas Search 已使用文档的 nGram 分词器创建了一个值为 week 的词元，该词元与搜索词匹配。如果使用 standard 或 edgeGram 分词器为 title 字段建立索引，Atlas Search 将不会返回搜索词 week 的任何结果。

下表列出了 nGram 分词器以及与之相比，standard 和 edgeGram 分词器为具有 _id: 1 的文档创建的标记：

分词器	Token Outputs
`standard`	`The`, `team's` , `weekly` , `meeting`
`edgeGram`	`The{SPACE}`, `The t` , `The te`
`nGram`	`The{SPACE}`, `The t`, `The te`, `he t`, ... , `week`, `weekl`, `weekly`, `eekl`, ..., `eetin`, `eeting`, `etin`, `eting`, `ting`

regexCaptureGroup

regexCaptureGroup分词器匹配Java正则表达式模式以提取词元。

提示

要学习；了解有关Java正则表达式语法的更多信息，请参阅Java文档中的 Pattern 类。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `regexCaptureGroup`。
`pattern`	字符串	是	要匹配的正则表达式。
`group`	整型	是	匹配表达式中要提取为标记的字符群组的索引。使用 `0` 提取所有字符组。

例子

以下索引定义使用名为 phoneNumberExtractor 的自定义分析器对 minutes 集合中的 page_updated_by.phone 字段建立索引。它使用以下命令：

mappings 字符过滤器，删除前三位数字周围的括号并用破折号替换所有空格和句点
regexCaptureGroup 分词器，用于根据文本输入中存在的第一个美国格式的电话号码创建单个词元

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 phoneNumberExtractor。
展开 Character Filters 并单击 Add character filter。
从下拉菜单中选择 mapping，然后点击 Add mapping。
在 Original 字段中输入以下字符，一次一个，并将相应的 Replacement 字段留空。
原始
更换
-
.
(
)
单击 Add character filter（连接）。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 regexCaptureGroup，然后为以下字段键入值：
字段
值
pattern
^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$
group
0
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在 page_updated_by.phone 字段上应用自定义分析器。
从 Field Name 下拉列表中选择 page_updated_by.phone，并从 Data Type 下拉列表中选择字符串。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 phoneNumberExtractor。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "fields": {
          "phone": {
            "analyzer": "phoneNumberExtractor",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [
        {
          "mappings": {
            " ": "-",
            "(": "",
            ")": "",
            ".": "-"
          },
          "type": "mapping"
        }
      ],
      "name": "phoneNumberExtractor",
      "tokenFilters": [],
      "tokenizer": {
        "group": 0,
        "pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$",
        "type": "regexCaptureGroup"
        }
    }
  ]
}

以下查询会在 minutes 集合的 page_updated_by.phone 字段中搜索电话号码 123-456-9870。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "123-456-9870",
6         "path": "page_updated_by.phone"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.phone": 1
14     }
15   }
16 ])

{ _id: 3, page_updated_by: { phone: '(123).456.9870' }

Atlas Search 在结果中返回具有 _id: 3 的文档，因为 Atlas Search 使用文档的 regexCaptureGroup 分词器创建了一个值为 123-456-7890 的词元，该词元与搜索词匹配。如果使用 standard 分词器为 page_updated_by.phone 字段建立索引，则 Atlas Search 将返回搜索词 123-456-7890 的所有文档。

下表列出了 regexCaptureGroup 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`123`, `456.9870`
`regexCaptureGroup`	`123-456-9870`

regexSplit

regexSplit分词器使用基于Java正则表达式的分隔符拆分词元。

提示

要学习；了解有关Java正则表达式语法的更多信息，请参阅Java文档中的 Pattern 类。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `regexSplit`。
`pattern`	字符串	是	要匹配的正则表达式。

例子

以下索引定义使用名为 dashDotSpaceSplitter 的自定义分析器对 minutes 集合中的 page_updated_by.phone 字段建立索引。它使用 regexSplit 分词器从 page_updated_by.phone 字段中的一个或多个连字符、句点和空格创建词元（可搜索词）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 dashDotSpaceSplitter。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 regexSplit，然后为以下字段键入值：
字段
值
pattern
[-. ]+
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在 page_updated_by.phone 字段上应用自定义分析器。
从 Field Name 下拉列表中选择 page_updated_by.phone，并从 Data Type 下拉列表中选择字符串。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 dashDotSpaceSplitter。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "fields": {
          "phone": {
            "analyzer": "dashDotSpaceSplitter",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "dashDotSpaceSplitter",
      "tokenFilters": [],
      "tokenizer": {
        "pattern": "[-. ]+",
        "type": "regexSplit"
      }
    }
  ]
}

下面的查询在 minutes 集合的 page_updated_by.phone 字段中搜索数字 9870。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "9870",
6         "path": "page_updated_by.phone"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.phone": 1
14     }
15   }
16 ])

{ _id: 3, page_updated_by: { phone: '(123).456.9870' }

Atlas Search 在结果中返回具有 _id: 3 的文档，因为 Atlas Search 使用文档的 regexSplit 分词器创建了一个值为 9870 的词元，该词元与搜索词匹配。如果使用 standard 分词器为 page_updated_by.phone 字段建立索引，Atlas Search 将不会返回搜索词 9870 的任何结果。

下表列出了 regexCaptureGroup 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`123`, `456.9870`
`regexSplit`	`(123)`, `456` , `9870`

标准

standard 分词器根据 Unicode 文本分割算法中的分词规则进行分词。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `standard`。
`maxTokenLength`	整型	no	单个标记的最大长度。大于此长度的标记将按 `maxTokenLength` 分割为多个标记。默认值：`255`

例子

以下索引定义使用名为 standardExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 standard 分词器和停用词词元过滤器。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 standardExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 standard。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 standardExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "standardExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "standardExample",
      "tokenFilters": [],
      "tokenizer": {
        "type": "standard"
      }
    }
  ]
}

下面的查询在 minutes 集合的 message 字段中搜索词语 signature。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "signature",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 4, message: 'write down your signature or phone №' }

Atlas Search 返回具有 _id: 4 的文档，因为 Atlas Search 使用文档的 standard 分词器创建了一个值为 signature 的词元，该词元与搜索词匹配。如果使用 keyword 分词器为 message 字段建立索引，Atlas Search 将不会返回搜索词 signature 的任何结果。

下表列出了 standard 分词器以及与之相比较的 keyword 分析器为具有 _id: 4 的文档创建的词元：

分词器	Token Outputs
`standard`	`write`, `down` , `your` , `signature` , `or` , `phone`
`keyword`	`write down your signature or phone №`

uaxUrlEmail

uaxUrlEmail分词器对 URL和电子邮件地址进行分词。尽管 uaxUrlEmail分词器根据Unicode 文本分段算法中的分词规则进行分词，但我们建议仅当索引字段值包含URL和电子邮件地址时才使用 uaxUrlEmail分词器。对于不包含URL或电子邮件地址的字段，请使用标准分词器根据分词规则创建词元。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `uaxUrlEmail`。
`maxTokenLength`	int	no	一个词元中的最大字符数。默认值：`255`

例子

以下索引定义使用名为 basicEmailAddressAnalyzer 的自定义分析器对 minutes 集合中的 page_updated_by.email 字段建立索引。它使用 uaxUrlEmail 分词器从 page_updated_by.email 字段中的 URL 和电子邮件地址创建词元（可搜索的术语）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 basicEmailAddressAnalyzer。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 uaxUrlEmail。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分，单击 Add Field Mapping，在 page_updated_by.email 字段中应用自定义分析器。
从Field Name下拉列表中选择 page_updated_by.email，并从 Data Type 下拉列表中选择 String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 basicEmailAddressAnalyzer。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "fields": {
      "page_updated_by": {
        "fields": {
          "email": {
            "analyzer": "basicEmailAddressAnalyzer",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "name": "basicEmailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      }
    }
  ]
}

下面的查询会在 minutes 集合的 page_updated_by.email 字段中搜索电子邮件 lewinsky@example.com。

db.minutes.aggregate([
  {
    "$search": {
      "text": {
        "query": "lewinsky@example.com",
        "path": "page_updated_by.email"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "page_updated_by.email": 1
    }
  }
])

{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }

Atlas Search 会在结果中返回具有 _id: 3 的文档，因为 Atlas Search 已使用文档的 uaxUrlEmail 分词器创建了一个值为 lewinsky@example.com 的词元，该词元与搜索词匹配。如果使用 standard 分词器为 page_updated_by.email 字段建立索引，Atlas Search 则会返回针对搜索词 lewinsky@example.com 的所有文档。

下表列出了 uaxUrlEmail 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`lewinsky`, `example.com`
`uaxUrlEmail`	`lewinsky@example.com`

以下索引定义使用名为 emailAddressAnalyzer 的自定义分析器对 minutes 集合中的 page_updated_by.email 字段建立索引。它使用以下命令：

采用 edgeGram 分词策略的 autocomplete 类型
uaxUrlEmail 分词器，用于从 URL 和电子邮件地址创建词元（可搜索词语）

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 emailAddressAnalyzer。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 uaxUrlEmail。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分，单击 Add Field Mapping，在 page_updated_by.email 字段中应用自定义分析器。
从 Field Name 下拉列表中选择 page_updated_by.email，并从 Data Type 下拉列表中选择自动完成。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 emailAddressAnalyzer。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "fields": {
      "page_updated_by": {
        "fields": {
          "email": {
            "analyzer": "emailAddressAnalyzer",
            "tokenization": "edgeGram",
            "type": "autocomplete"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "name": "emailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      }
    }
  ]
}

下面的查询在 minutes 集合的 page_updated_by.email 字段中搜索词语 exam。

db.minutes.aggregate([
  {
    "$search": {
      "autocomplete": {
        "query": "lewinsky@example.com",
        "path": "page_updated_by.email"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "page_updated_by.email": 1
    }
  }
])

_id: 3, page_updated_by: { email: 'lewinsky@example.com' } }

下表列出了 uaxUrlEmail 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Atlas Search 字段类型	Token Outputs
`standard`	`autocomplete` `edgeGram`	`le`, `lew`, `lewi`, `lewin`, `lewins`, `lewinsk`, `lewinsky`, `lewinsky@`, `lewinsky`, `ex`, `exa`, `exam`, `examp`, `exampl`, `example`, `example.`, `example.c`, `example.co`, `example.com`
`uaxUrlEmail`	`autocomplete` `edgeGram`	`le`, `lew`, `lewi`, `lewin`, `lewins`, `lewinsk`, `lewinsky`, `lewinsky@`, `lewinsky@e`, `lewinsky@ex`, `lewinsky@exa`, `lewinsky@exam`, `lewinsky@examp`, `lewinsky@exampl`

whitespace

whitespace分词器根据单词之间出现的空格进行分词。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `whitespace`。
`maxTokenLength`	整型	no	单个标记的最大长度。大于此长度的标记将按 `maxTokenLength` 分割为多个标记。默认值：`255`

例子

以下索引定义使用名为 whitespaceExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 whitespace 分词器从 message 字段中的任意空格创建词元（可搜索词）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "whitespaceExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "whitespaceExample",
      "tokenFilters": [],
      "tokenizer": {
        "type": "whitespace"
      }
    }
  ]
}

下面的查询在 minutes 集合的 message 字段中搜索词语 SIGN-IN。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "SIGN-IN",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 2, message: 'do not forget to SIGN-IN' }

Atlas Search 在结果中返回具有 _id: 2 的文档，因为 Atlas Search 使用文档的 whitespace 分词器创建了一个值为 SIGN-IN 的词元，该词元与搜索词匹配。如果使用 standard 分词器为 message 字段建立索引，则对于搜索词 SIGN-IN，Atlas Search 将返回具有 _id: 1、_id: 2 和 _id: 3 文档。

下表列出了 whitespace 分词器以及与之相比较的 standard 分词器为具有 _id: 2 的文档创建的词元：

分词器	Token Outputs
`standard`	`do`, `not` , `forget` , `to` , `sign` , `in`
`whitespace`	`do`, `not` , `forget` , `to` , `SIGN-IN`

后退

字符筛选器

来年

词元筛选器

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"text": {
5	"query": "tr",
6	"path": "message"
7	}
8	}
9	},
10	{
11	"$project": {
12	"_id": 1,
13	"message": 1
14	}
15	}
16	])

字段	值
minGram	`2`
maxGram	`7`

字段	值
minGram	`4`
maxGram	`6`

原始	更换
`-`
`.`
`(`
`)`

字段	值
pattern	`^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$`
group	`0`