文字フィルター

文字フィルターは、テキストを一度に 1 文字ずつ検査し、フィルタリング操作を実行します。文字フィルターにはタイプフィールドが必要で、一部の文字フィルターには追加のオプションも使われます。

構文

"charFilters": [
  {
    "type": "<filter-type>",
    "<additional-option>": <value>
  }
]

文字フィルタータイプ

Atlas Search は以下の文字フィルタータイプをサポートしています。

htmlStop
icuNormalize
マッピング
ペルシャ語

次のサンプルインデックス定義とクエリでは、サンプルコレクション minutesを使用します。これらの例に従うには、クラスターに minutes コレクションを読み込み、「Atlas Searchインデックスの作成」チュートリアルの手順に従って Atlas UI の Create a Search Index ページに移動します。次に、minutes コレクションをデータソースとして選択し、例の手順に従って Visual Editor または JSON editor にインデックスを作成します。

htmlStop

htmlStrip文字フィルターは HTML 構造を削除します。

属性

htmlStrip文字フィルターには次の属性があります。

名前	タイプ	必須	説明
`type`	string	はい	この文字フィルター型を識別する、人間が判読できるラベル。値は`htmlStrip`である必要があります。
`ignoredTags`	文字列の配列	はい	フィルタリングから除外する HTML タグを含むリスト。

例

次のインデックス定義の例では、 htmlStrippingAnalyzerという名前のカスタムアナライザを使用して、分コレクションのtext.en_USフィールドにインデックスを作成します。カスタムアナライザは、以下を指定します。

htmlStrip文字フィルタを使用して、 aタグを除くすべての HTML タグをテキストから削除します。
Unicode テキスト分割アルゴリズムの単語ブレークルールに基づいてトークンを生成標準のトークナイザを使用します。

Custom Analyzersセクションで、 Add Custom Analyzerをクリックします。
Create Your Ownラジオボタンを選択し、Next をクリックします。
Analyzer NameフィールドにhtmlStrippingAnalyzerと入力します。
Character Filtersを展開し、をクリックしますAdd character filter 。
ドロップダウンから [ htmlStripを選択し、 ignoredTagsフィールドにaと入力します。
[Add character filter] をクリックします。
Tokenizerが折りたたまれている場合は展開し、ドロップダウンからstandardを選択します。
Addをクリックして、カスタムアナライザをインデックスに追加します。
Field Mappingsセクションで、 Add Field Mappingをクリックしてtext.en_USにカスタムアナライザを適用しますネストされたフィールド。
text.en_US を選択 Field Nameドロップダウンからネストされたstringと、 Data Typeドロップダウンからネストされた string 。
データタイプのプロパティhtmlStrippingAnalyzer セクションで、Index Analyzer Search Analyzerドロップダウンとドロップダウンから [ を選択します。
[ Addをクリックし、 Save Changesをクリックします。

デフォルトのインデックス定義を以下のように置き換えます。

1  {
2    "mappings": {
3      "fields": {
4        "text": {
5          "type": "document",
6          "dynamic": true,
7          "fields": {
8            "en_US": {
9              "analyzer": "htmlStrippingAnalyzer",
10              "type": "string"
11            }
12          }
13        }
14      }
15    },
16    "analyzers": [{
17      "name": "htmlStrippingAnalyzer",
18      "charFilters": [{
19        "type": "htmlStrip",
20        "ignoredTags": ["a"]
21      }],
22      "tokenizer": {
23        "type": "standard"
24      },
25      "tokenFilters": []
26    }]
27  }

次のクエリは、 minutesコレクションのtext.en_USフィールドで string headの出現を検索します。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "head",
6         "path": "text.en_US"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "text.en_US": 1
14     }
15   }
16 ])

[
  {
    _id: 2,
    text: { en_US: "The head of the sales department spoke first." }
  },
  {
    _id: 3,
    text: {
      en_US: "<body>We'll head out to the conference room by noon.</body>"
    }
  }
]

Atlas Searchでは、 string head が HTML タグ <head> の一部であるため、_id: 1 を含むドキュメントは返されません。 _id: 3 を含むドキュメントには HTML タグが含まれていますが、 string head は他の場所にあるため、ドキュメントは一致します。次の表は、 htmlStrippingAnalyzerを使用して、分コレクション内のドキュメント_id: 1 、 _id: 2 、 _id: 3のtext.en_USフィールド値に対して Atlas Search が生成するトークンを示しています。

ドキュメント ID	出力トークン
`_id: 1`	`This`, `page`, `deals`, `with`, `department`, `meetings`
`_id: 2`	`The`, `head`, `of`, `the`, `sales`, `department`, `spoke`, `first`
`_id: 3`	`We'll`, `head`, `out`, `to`, `the`, `conference`, `room`, `by`, `noon`

icuNormalize

icuNormalize文字フィルターは、テキストを ICU で正規化します正規表現。これは、 Lucene の ICUNormazer2 CharFilter に基づいています。

属性

icuNormalize 文字フィルターには次の属性があります。

名前	タイプ	必須	説明
`type`	string	はい	この文字フィルター型を識別する、人間が判読できるラベル。値は`icuNormalize`である必要があります。

例

次のインデックス定義の例では、 normalizingAnalyzerという名前のカスタムアナライザを使用して、分コレクションのmessageフィールドにインデックスを作成します。カスタムアナライザは、以下を指定します。

icuNormalize文字フィルターを使用して、 messageフィールド値のテキストを正規化します。
空白トークナイザを使用して、単語間の空白の発生に基づいてフィールド内の単語をトークン化します。

Custom Analyzersセクションで、 Add Custom Analyzerをクリックします。
Create Your Ownラジオボタンを選択し、Next をクリックします。
Analyzer NameフィールドにnormalizingAnalyzerと入力します。
Character Filtersを展開し、をクリックしますAdd character filter 。
ドロップダウンから [ icuNormalizeを選択し、[ Add character filter ] をクリックします。
Tokenizerが折りたたまれている場合は展開し、ドロップダウンからwhitespaceを選択します。
Addをクリックして、カスタムアナライザをインデックスに追加します。
Field Mappingsセクションで、 Add Field Mappingをクリックして、メッセージフィールドにカスタムアナライザを適用します。
Field Name ドロップダウンからメッセージを選択し、Data Type ドロップダウンからstringを選択します。
データ型のプロパティnormalizingAnalyzer Index Analyzerセクションで、Search Analyzer ドロップダウンとドロップダウンからを選択します。
[ Addをクリックし、 Save Changesをクリックします。

デフォルトのインデックス定義を以下のように置き換えます。

1 {
2   "mappings": {
3     "fields": {
4       "message": {
5         "type": "string",
6         "analyzer": "normalizingAnalyzer"
7       }
8     }
9   },
10   "analyzers": [
11     {
12       "name": "normalizingAnalyzer",
13       "charFilters": [
14         {
15           "type": "icuNormalize"
16         }
17       ],
18       "tokenizer": {
19         "type": "whitespace"
20       },
21       "tokenFilters": []
22     }
23   ]
24 }

次のクエリは、 minutesコレクションのmessageフィールドで string no （数値）の出現を検索します。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "no",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1,
14       "title": 1
15     }
16   }
17 ])

[
  {
    _id: 4,
    title: 'The daily huddle on tHe StandUpApp2',
    message: 'write down your signature or phone №'
  }
]

Atlas Search がを含むドキュメントをクエリ用語_id: 4 no№icuNormalizeと照合したのは、文字フィルターを使用してフィールドの数字記号no を正規化し、「数字」という単語のその省略記号を作成したためです。 ".Atlas Search は、 normalizingAnalyzerを使用して、ドキュメント_id: 4のmessageフィールド値に対して次のトークンを生成します。

ドキュメント ID	出力トークン
`_id: 4`	`write`, `down`, `your`, `signature`, `or`, `phone`, `no`

マッピング

mapping文字フィルターは、文字にユーザー指定の正規化マッピングを適用します。これは Lucene の MappingCharFilter に基づいています。

属性

mapping文字フィルターには次の属性があります。

名前	タイプ	必須	説明
`type`	string	はい	この文字フィルタータイプを識別する、人間が判読できるラベル。値は`mapping`でなければなりません。
`mappings`	オブジェクト	はい	マッピングのコンマ区切りリストを含むオブジェクト。マッピングは、ある文字または文字のグループを別の文字に置き換える必要があることを`<original> : <replacement>`形式で示します。

例

次のインデックス定義の例では、 mappingAnalyzerという名前のカスタムアナライザを使用して、分コレクションのpage_updated_by.phoneフィールドにインデックスを作成します。カスタムアナライザは、以下を指定します。

mapping文字フィルターを使用して、ハイフン（ - ）、ドット（ . ）、開始括弧（ ( ）、閉じ括弧（ ) ）、電話フィールド内の空白文字を削除します。
キーワードトークナイザを使用して、入力全体を単一のトークンとしてトークン化します。

Custom Analyzersセクションで、 Add Custom Analyzerをクリックします。
Create Your Ownラジオボタンを選択し、Next をクリックします。
Analyzer NameフィールドにmappingAnalyzerと入力します。
Character Filtersを展開し、をクリックしますAdd character filter 。
ドロップダウンからmappingを選択し、をクリックします。 Add mapping 。
Originalフィールドに次の文字を 1 つずつ入力し、対応するReplacementフィールドは空のままにします。
元の
replacement
-
.
(
)
{SPACE}
[Add character filter] をクリックします。
Tokenizerが折りたたまれている場合は展開し、ドロップダウンからkeywordを選択します。
Addをクリックして、カスタムアナライザをインデックスに追加します。
Field MappingsセクションでAdd Field Mappingをクリックして、ページ_update_by . 電話（ネストされた）フィールドにカスタムアナライザを適用します。
Field Nameドロップダウンからpage_update_by. 電話（ネストされた）を選択し、 Data Typeドロップダウンからstringを選択します。
データタイプのプロパティmappingAnalyzer セクションで、Index Analyzer Search Analyzerドロップダウンとドロップダウンから [ を選択します。
[ Addをクリックし、 Save Changesをクリックします。

デフォルトのインデックス定義を以下のように置き換えます。

1 {
2   "mappings": {
3     "fields": {
4       "page_updated_by": {
5         "fields": {
6           "phone": {
7             "analyzer": "mappingAnalyzer",
8             "type": "string"
9           }
10         },
11         "type": "document"
12       }
13     }
14   },
15   "analyzers": [
16     {
17       "name": "mappingAnalyzer",
18       "charFilters": [
19         {
20           "mappings": {
21             "-": "",
22             ".": "",
23             "(": "",
24             ")": "",
25             " ": ""
26           },
27           "type": "mapping"
28         }
29       ],
30       "tokenizer": {
31         "type": "keyword"
32       }
33     }
34   ]
35 }

次のクエリは、 page_updated_by.phoneフィールドで string 1234567890を検索します。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "1234567890",
6         "path": "page_updated_by.phone"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.phone": 1,
14       "page_updated_by.last_name": 1
15     }
16   }
17 ])

[
  {
    _id: 1,
    page_updated_by: { last_name: 'AUERBACH', phone: '(123)-456-7890' }
  }
]

Atlas Search結果には、phone string内の数値がクエリstring列と一致する 1 つのドキュメントが含まれます。 Atlas Searchがドキュメントをクエリstringと照合したのは、クエリに電話番号を囲む括弧と数字の間にあるハイフンが含まれていないにもかかわらず、 Atlas Searchが mapping 文字フィルターを使用してこれらの文字を削除し、次用の単一のトークンを作成したためです。フィールド値。具体的には、Atlas Search は_id: 1を使用して、ドキュメントのphoneフィールドに対して次のトークンを生成しました。

ドキュメント ID	出力トークン
`_id: 1`	`1234567890`

Atlas Searchは、(123)-456-7890、123-456-7890、123.456.7890 などの検索でも、_id: 1 を持つドキュメントと一致します。 stringフィールドにインデックスを付ける方法では、 Atlas Searchはインデックスアナライザ（または指定されている場合は（searchAnalyzer を使用）。次の表は、ハイフン（ - ）、ドット（ . ）、開始括弧（ ( ）、閉じ括弧（ ) ）、クエリ内の空白文字のインスタンスを削除して Atlas Search が作成するトークンを示しています。ターム:

クエリ用語	出力トークン
`(123)-456-7890`	`1234567890`
`123-456-7890`	`1234567890`
`123.456.7890`	`1234567890`

ペルシャ語

persian文字フィルターは、ゼロ幅の非結合のインスタンスを置き換えますスペース文字で始めることはできません。この文字フィルターは、 Lucene の PersianCharFilter に基づいています。

属性

persian 文字フィルターには次の属性があります。

名前	タイプ	必須	説明
`type`	string	はい	この文字フィルター型を識別する、人間が判読できるラベル。値は`persian`である必要があります。

例

次のインデックス定義の例では、 persianCharacterIndexという名前のカスタムアナライザを使用して、分コレクションのtext.fa_IRフィールドにインデックスを作成します。カスタムアナライザは、以下を指定します。

persian文字フィルターを適用して、フィールド値内の印刷以外の文字をスペース文字に置き換えます。
ホワイトスペーストークナイザを使用して、単語間の空白の発生に基づいてトークンを作成します。

Custom Analyzersセクションで、 Add Custom Analyzerをクリックします。
Create Your Ownラジオボタンを選択し、Next をクリックします。
Analyzer NameフィールドにpersianCharacterIndexと入力します。
Character Filtersを展開し、をクリックしますAdd character filter 。
ドロップダウンから [ persianを選択し、[ Add character filter ] をクリックします。
Tokenizerが折りたたまれている場合は展開し、ドロップダウンからwhitespaceを選択します。
Addをクリックして、カスタムアナライザをインデックスに追加します。
Field Mappingsセクションで、 Add Field Mappingをクリックしてtext.fi_IRにカスタムアナライザを適用します（ネストされた）フィールド。
text.FA_IR を選択します Field Name ドロップダウンからは（ネストされた）、Data Type ドロップダウンからはstringを使用します。
データ型のプロパティpersianCharacterIndex Index Analyzerセクションで、Search Analyzer ドロップダウンとドロップダウンからを選択します。
[ Addをクリックし、 Save Changesをクリックします。

デフォルトのインデックス定義を以下のように置き換えます。

1 {
2   "analyzer": "lucene.standard",
3   "mappings": {
4     "fields": {
5       "text": {
6         "dynamic": true,
7         "fields": {
8           "fa_IR": {
9             "analyzer": "persianCharacterIndex",
10             "type": "string"
11           }
12         },
13         "type": "document"
14       }
15     }
16   },
17   "analyzers": [
18     {
19       "name": "persianCharacterIndex",
20       "charFilters": [
21         {
22           "type": "persian"
23         }
24       ],
25       "tokenizer": {
26         "type": "whitespace"
27       }
28     }
29   ]
30 }

次のクエリは、text.fa_IR フィールドで صحبت というタームを検索します。

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "صحبت",
6         "path": "text.fa_IR"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "text.fa_IR": 1,
14       "page_updated_by.last_name": 1
15     }
16   }
17 ])

[
  {
    _id: 2,
    page_updated_by: { last_name: 'OHRBACH' },
    text: { fa_IR: 'ابتدا رئیس بخش فروش صحبت کرد' }
  }
]

Atlas Search では、クエリ用語を含む_id: 2ドキュメントが返されます。 Atlas Search は、まずゼロ幅非結合のインスタンスをスペース文字に置き換え、次に単語間の空白の発生に基づいてフィールド値内の各単語の個別のトークンを作成することで、クエリ用語をドキュメントと照合します。具体的には、Atlas Search は、 _id: 2を持つドキュメントに対して次のトークンを生成します。

ドキュメント ID	出力トークン
`_id: 2`	`ابتدا`, `رئیس`, `بخش`, `فروش`, `صحبت`, `کرد`

詳細

mapping 文字フィルターを使用する追加のインデックス定義とクエリを確認するには、次の参照ページの例をご覧ください。

単一トークンフィルター
regexCaptureGroup トークナイザ

戻る

カスタム

トークナイザ

1	{
2	"mappings": {
3	"fields": {
4	"text": {
5	"type": "document",
6	"dynamic": true,
7	"fields": {
8	"en_US": {
9	"analyzer": "htmlStrippingAnalyzer",
10	"type": "string"
11	}
12	}
13	}
14	}
15	},
16	"analyzers": [{
17	"name": "htmlStrippingAnalyzer",
18	"charFilters": [{
19	"type": "htmlStrip",
20	"ignoredTags": ["a"]
21	}],
22	"tokenizer": {
23	"type": "standard"
24	},
25	"tokenFilters": []
26	}]
27	}

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"text": {
5	"query": "head",
6	"path": "text.en_US"
7	}
8	}
9	},
10	{
11	"$project": {
12	"_id": 1,
13	"text.en_US": 1
14	}
15	}
16	])

1	{
2	"mappings": {
3	"fields": {
4	"message": {
5	"type": "string",
6	"analyzer": "normalizingAnalyzer"
7	}
8	}
9	},
10	"analyzers": [
11	{
12	"name": "normalizingAnalyzer",
13	"charFilters": [
14	{
15	"type": "icuNormalize"
16	}
17	],
18	"tokenizer": {
19	"type": "whitespace"
20	},
21	"tokenFilters": []
22	}
23	]
24	}

1	{
2	"mappings": {
3	"fields": {
4	"page_updated_by": {
5	"fields": {
6	"phone": {
7	"analyzer": "mappingAnalyzer",
8	"type": "string"
9	}
10	},
11	"type": "document"
12	}
13	}
14	},
15	"analyzers": [
16	{
17	"name": "mappingAnalyzer",
18	"charFilters": [
19	{
20	"mappings": {
21	"-": "",
22	".": "",
23	"(": "",
24	")": "",
25	" ": ""
26	},
27	"type": "mapping"
28	}
29	],
30	"tokenizer": {
31	"type": "keyword"
32	}
33	}
34	]
35	}

1	{
2	"analyzer": "lucene.standard",
3	"mappings": {
4	"fields": {
5	"text": {
6	"dynamic": true,
7	"fields": {
8	"fa_IR": {
9	"analyzer": "persianCharacterIndex",
10	"type": "string"
11	}
12	},
13	"type": "document"
14	}
15	}
16	},
17	"analyzers": [
18	{
19	"name": "persianCharacterIndex",
20	"charFilters": [
21	{
22	"type": "persian"
23	}
24	],
25	"tokenizer": {
26	"type": "whitespace"
27	}
28	}
29	]
30	}