/ /

/ /

언어 분석기

언어별 분석기를 사용하여 특정 언어에 맞는 인덱스를 생성합니다. 각 언어 분석기에는 해당 언어의 사용 패턴에 따라 중지 단어와 단어 구분이 내장되어 있습니다.

MongoDB Search는 다음과 같은 언어 분석기를 제공합니다.

`lucene.arabic`	`lucene.armenian`	`lucene.basque`	`lucene.bengali`
`lucene.brazilian`	`lucene.bulgarian`	`lucene.catalan`	`lucene.chinese`
`lucene.cjk` ¹	`lucene.czech`	`lucene.danish`	`lucene.dutch`
`lucene.english`	`lucene.finnish`	`lucene.french`	`lucene.galician`
`lucene.german`	`lucene.greek`	`lucene.hindi`	`lucene.hungarian`
`lucene.indonesian`	`lucene.irish`	`lucene.italian`	`lucene.japanese`
`lucene.korean`	`lucene.kuromoji` ²	`lucene.latvian`	`lucene.lithuanian`
`lucene.morfologik` ³	`lucene.nori` ⁴	`lucene.norwegian`	`lucene.persian`
`lucene.polish`	`lucene.portuguese`	`lucene.romanian`	`lucene.russian`
`lucene.smartcn` ⁵	`lucene.sorani`	`lucene.spanish`	`lucene.swedish`
`lucene.thai`	`lucene.turkish`	`lucene.ukrainian`

¹ cjk 중국어, 일본어 및 한국어 분석기입니다.

² kuromoji 은(는) 일본어 분석기입니다.

³ morfologik 은(는) 폴란드어 분석기입니다.

⁴ nori 는 한국어 분석기입니다.

⁵ smartcn 는 중국어 분석기입니다.

예시

다음 문서가 포함된 cars (이)라는 이름의 컬렉션이 있다고 가정해 보겠습니다.

{
  "_id": 1,
  "subject": {
    "en": "It is better to equip our cars to understand the causes of the accident.",
    "fr": "Mieux équiper nos voitures pour comprendre les causes d'un accident.",
    "he": "עדיף לצייד את המכוניות שלנו כדי להבין את הגורמים לתאונה."
  }
}

{
  "_id": 2,
  "subject": {
    "en": "The best time to do this is immediately after you've filled up with fuel",
    "fr": "Le meilleur moment pour le faire c'est immédiatement après que vous aurez fait le plein de carburant.",
    "he": "הזמן הטוב ביותר לעשות זאת הוא מיד לאחר שמילאת דלק."
  }
}

내장 언어 분석기 예제

다음 예제 인덱스 정의는 french 분석기를 사용하여 subject.fr 필드에 대한 인덱스를 지정합니다.

{
  "mappings": {
    "fields": {
      "subject": {
        "fields": {
          "fr": {
            "analyzer": "lucene.french",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  }
}

다음 쿼리는 subject.fr 필드에서 pour string 을 검색합니다.

db.cars.aggregate([
  {
    $search: {
      "text": {
        "query": "pour",
        "path": "subject.fr"
      }
    }
  },
  {
    $project: {
      "_id": 0,
      "subject.fr": 1
    }
  }
])

french 분석기를 사용할 때 이전 쿼리는 결과를 반환하지 않는데, 이는 pour 가 내장된 중지 단어이기 때문입니다. standard 분석기를 사용하면 동일한 쿼리로 두 문서가 모두 반환됩니다.

다음 쿼리는 subject.fr 필드에서 carburant string 을 검색합니다.

db.cars.aggregate([
  {
    $search: {
      "text": {
        "query": "carburant",
        "path": "subject.fr"
      }
    }
  },
  {
    $project: {
      "_id": 0,
      "subject.fr": 1
    }
  }
])

{ subject: { fr: "Le meilleur moment pour le faire c'est immédiatement après que vous aurez fait le plein de carburant." } }

쿼리 lucene.french 분석기 문서 에 대해 생성한 토큰과 일치하기 때문에 MongoDB Search는 결과에 _id: 1 이 포함된 문서 반환합니다. lucene.french 분석기 _id: 1가 있는 문서 의 subject.fr 필드 에 대해 다음 토큰을 생성합니다.

`meileu`	`moment`	`fair`
`est`	`imediat`	`aprè`
`fait`	`plein`	`carburant`

사용자 지정 언어 분석기 예시

icuFolding 및 stopword 토큰 필터를 사용하여 사용자 정의 분석기를 만들어 지원되지 않는 언어에 대한 인덱스를 생성할 수도 있습니다.

다음 예시 인덱스 정의는 히브리어 텍스트를 분석하고 토큰을 생성하기 위해 myHebrewAnalyzer라는 사용자 지정 분석기를 사용하여 subject.he 필드에 인덱스를 지정합니다.

{
  "analyzer": "lucene.standard",
  "mappings": {
    "dynamic": false,
    "fields": {
      "subject": {
        "fields": {
          "he": {
            "analyzer": "myHebrewAnalyzer",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "myHebrewAnalyzer",
      "tokenFilters": [
        {
          "type": "icuFolding"
        },
        {
          "tokens": [
            "אן",
            "שלנו",
            "זה",
            "אל"
          ],
          "type": "stopword"
        }
      ],
      "tokenizer": {
        "type": "standard"
      }
    }
  ]
}

다음 쿼리는 subject.he 필드에서 המכוניות string 을 검색합니다.

db.cars.aggregate([
  {
    $search: {
      "text": {
        "query": "המכוניות",
        "path": "subject.he"
      }
    }
  },
  {
    $project: {
      "_id": 0,
      "subject.he": 1
    }
  }
])

{ subject: { he: 'עדיף לצייד את המכוניות שלנו כדי להבין את הגורמים לתאונה.' } }

쿼리 myHebrewAnalyzer 분석기 문서 에 대해 생성한 토큰과 일치하기 때문에 MongoDB Search는 결과에 _id: 1 이 포함된 문서 반환합니다. myHebrewAnalyzer 분석기 _id: 1가 있는 문서 의 subject.he 필드 에 대해 다음 토큰을 생성합니다.

`עדיף`	`לצייד`	`את`
`המכוניות`	`כדי`	`להבין`
`את`	`הגורמים`	`לתאונה`

다국어 검색 예시

여러 언어 분석기를 사용하여 다국어 검색을 수행하는 인덱스를 생성할 수도 있습니다.

다음 예시 인덱스 정의는 sample_mflix.movies 컬렉션 에 동적 매핑을 사용하여 인덱스 지정합니다. 이 정의는 lucene.italian 언어 분석기 적용하여 fullplot 필드 인덱스 하고 다중 옵션을 사용하여 lucene.english 를 대체 언어 분석기 로 지정합니다. MongoDB Search는 movies 컬렉션 에서 동적으로 인덱싱하는 다른 모든 필드에 대해 기본값 lucene.english 언어 분석기 사용합니다.

{
  "analyzer": "lucene.standard",
  "mappings": {
    "dynamic": true,
    "fields": {
      "fullplot": {
        "type": "string",
        "analyzer": "lucene.italian",
        "multi": {
          "fullplot_english": {
            "type": "string",
            "analyzer": "lucene.english",
          }
        }
      }
   }
  }
}

다음 MongoDB Search 쿼리 복합 연산자 절을 사용하여 컬렉션 쿼리 .

must 절은 Bella라는 텀을 포함하는 영어 및 이탈리아어 영화 줄거리를 텍스트 연산자를 사용하여 검색합니다.
mustNot 절은 range 연산자를 사용하여 1984년에서 2016년 사이에 개봉된 영화를 제외합니다.
should 절은 Comedy 장르에 대한 선호도를 text 연산자를 사용하여 지정합니다.

[
  {
    $search: {
      "index": "multilingual-tutorial",
      "compound": {
        "must": [{
          "text": {
            "query": "Bella",
            "path": { "value": "fullplot", "multi": "fullplot_english" }
          }
        }],
        "mustNot": [{
          "range": {
            "path": "released",
            "gt": ISODate("1984-01-01T00:00:00.000Z"),
            "lt": ISODate("2016-01-01T00:00:00.000Z")
          }
        }],
        "should": [{
          "text": {
            "query": "Comedy",
            "path": "genres"
          }
        }]
      }
    }
  }
]

SCORE: 3.909510850906372  _id: "573a1397f29313caabce8bad"
  plot: "He is a revenge-obssessed stevedore whose sister was brutally raped an…"
  genres:
    0: "Drama"
  runtime: 137
  fullplot: "In Marseilles, a woman commits suicide after she is raped in an alley.…"
  released: 1983-05-18T00:00:00.000+00:00
SCORE: 3.4253346920013428  _id: "573a1396f29313caabce5735"
  plot: "Giovanna e' una bella ragazza, ma ha qualche problema con gli uomini: …"
  genres:
    0: "Comedy"
  runtime: 100
  fullplot: "Giovanna e' una bella ragazza, ma ha qualche problema con gli uomini: …"
  released: 1974-11-15T00:00:00.000+00:00
SCORE: 3.363344430923462  _id: "573a1395f29313caabce13cf"
  plot: "Gerardo è un attore o almeno cerca di esserlo, ma il pubblico non è de…"
  genres:
    0: "Comedy"
  runtime: 95
  fullplot: "Gerardo è un attore o almeno cerca di esserlo, ma il pubblico non è de…"
  released: 1960-02-10T00:00:00.000+00:00
SCORE: 1.9502882957458496  _id: "573a1396f29313caabce5299"
  plot: "Dr Tremayne is an enigmatic Psychiatrist running a
  Futuristic asylum h…"
  genres:
    0: "Horror"
  runtime: 90
  fullplot: "Dr Tremayne is an enigmatic Psychiatrist running a Futuristic asylum h…"
  released: 1973-10-31T00:00:00.000+00:00

돌아가기

Keyword

멀티