How to Define a Custom Analyzer and Run an Atlas Search Diacritic-Insensitive Query
This tutorial describes how to create an index that uses a
custom analyzer and run a
diacritic-insensitive query against the sample_mflix.movies
collection. It takes you through the following steps:
Set up an Atlas Search index on the
title
andgenres
fields in thesample_mflix.movies
collection.Run an Atlas Search compound query against the
title
andgenres
fields in thesample_mflix.movies
collection using the wildcard and text operators.
Before you begin, ensure that your Atlas cluster meets the requirements described in the Prerequisites.
To create an Atlas Search index, you must have Project Data Access Admin
or higher access to the project.
Create the Atlas Search Index
In this section, you will create an Atlas Search index on the title
and
genres
fields in the sample_mflix.movies
collection.
In Atlas, go to the Clusters page for your project.
If it's not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.
If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
If it's not already displayed, click Clusters in the sidebar.
The Clusters page displays.
Enter the Index Name, and set the Database and Collection.
In the Index Name field, enter
diacritic-insensitive-tutorial
.If you name your index
default
, you don't need to specify anindex
parameter in the $search pipeline stage. If you give a custom name to your index, you must specify this name in theindex
parameter.In the Database and Collection section, find the
sample_mflix
database, and select themovies
collection.
Specify an index definition.
This index definition for the genres
and title
fields
specifies a custom analyzer, diacriticFolder
, using the following:
keyword tokenizer that tokenizes the entire input as a single token.
icuFolding token filter that applies character foldings such as accent removal and case folding.
The index definition specifies a string type for the genres
and title
fields. It also applies the custom analyzer named
diacriticFolder
on the title
field.
Use the Atlas Search Visual Editor or Atlas Search JSON Editor in the Atlas user interface to create the index.
Click Next.
Click Refine Your Index.
In the Custom Analyzers section, click Add Custom Analyzer.
Select the Create Your Own radio button and click Next.
Type
diacriticFolder
in the Analyzer Name field.Expand Tokenizer if it's collapsed and select
keyword
from the dropdown.Expand Token Filters and click Add token filter.
Select
icuFolding
from the dropdown and click Add token filter to add the token filter to your custom analyzer.Click Add to add the custom analyzer to your index.
In the Field Mappings section, click Add Field Mapping to apply the custom analyzer on the
title
field in the Customized Configuration tab.Select
title
from the Field Name dropdown and String from the Data Type dropdown.In the properties section for the data type, select
diacriticFolder
from the Index Analyzer and Search Analyzer dropdowns.Click Add.
Click Add Field Mapping again to index the
genres
field.Select
genres
from the Field Name dropdown and String from the Data Type dropdown.Click Add, then Save Changes.
Replace the default definition with the following:
1 { 2 "mappings": { 3 "fields": { 4 "genres": { 5 "type": "string" 6 }, 7 "title": { 8 "analyzer": "diacriticFolder", 9 "type": "string" 10 } 11 } 12 }, 13 "analyzers": [{ 14 "charFilters": [], 15 "name": "diacriticFolder", 16 "tokenizer": { 17 "type": "keyword" 18 }, 19 "tokenFilters": [{ 20 "type": "icuFolding" 21 }] 22 }] 23 }
Click Next.
Search the Collection
➤ Use the Select your language drop-down menu to set the language of the example in this section.
You can use the compound operator to combine two or more
operators into a single query. The sample query in this section uses the
compound operator to query the title
and genres
fields in the movies
collection using multiple operators.
In this section, connect to your Atlas
cluster and run the sample query against the
sample_mflix.movies
collection using the compound
operator.
In Atlas, go to the Clusters page for your project.
If it's not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.
If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
If it's not already displayed, click Clusters in the sidebar.
The Clusters page displays.
Run an Atlas Search diacritic-insensitive query.
This query uses the $search
stage to query the collection
using the compound
operator. The compound
operator uses
the following clauses:
must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
Copy and paste the following query into the Query Editor, and then click the Search button in the Query Editor.
1 [ 2 { 3 "$search" : { 4 "index": "diacritic-insensitive-tutorial", 5 "compound" : { 6 "must": [{ 7 "wildcard" : { 8 "query" : "alle*", 9 "path": "title", 10 "allowAnalyzedField": true 11 } 12 }], 13 "should": [{ 14 "text": { 15 "query" : "Drama", 16 "path" : "genres" 17 } 18 }] 19 } 20 } 21 } 22 ]
SCORE: 1.2084882259368896 _id: "573a13a1f29313caabd07bb6" plot: "A group of hip retro teenage outsiders become involved in an interscho…" genres: 0: "Drama" 1: "Family" 2: "Sport" runtime: 103 title: "Alley Cats Strike" SCORE: 1.179288625717163 _id: "573a13b1f29313caabd382a2" plot: "Famous pianist Zetterstrèm returns home to his native Denmark, to give…" genres: 0: "Drama" 1: "Romance" 2: "Sci-Fi" runtime: 88 title: "Allegro" SCORE: 1 _id: "573a1397f29313caabce5f15" plot: "An enthusiastic filmmaker thinks he's come up with a totally original …" genres: 0: "Animation" 1: "Comedy" 2: "Fantasy" runtime: 75 title: "Allegro non troppo" SCORE: 1 _id: "573a13d1f29313caabd8f84b" plot: "The eleven year old cycling talent Freddy is the son of a butcher in a…" genres: 0: "Comedy" runtime: 100 title: "Allez, Eddy!"
Expand your query results.
The Search Tester might not display all the fields in the documents it returns. To view all the fields, including the field that you specify in the query path, expand the document in the results.
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Connect to your cluster in mongosh
.
Open mongosh
in a terminal window and
connect to your cluster. For detailed instructions on connecting,
see Connect via mongosh
.
Use the sample_mflix
database.
Run the following command at mongosh
prompt:
use sample_mflix
Run an Atlas Search diacritic-insensitive query.
This query uses the $search
stage to query the collection
using the compound
operator. The compound
operator uses
the following clauses:
must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the $project
stage to:
Exclude all fields except
title
andgenres
Add a field named
score
1 db.movies.aggregate([ 2 { 3 "$search" : { 4 "index": "diacritic-insensitive-tutorial", 5 "compound" : { 6 "must": [{ 7 "wildcard" : { 8 "query" : "allè*", 9 "path": "title", 10 "allowAnalyzedField": true 11 } 12 }], 13 "should": [{ 14 "text": { 15 "query" : "Drama", 16 "path" : "genres" 17 } 18 }] 19 } 20 } 21 }, 22 { 23 "$project" : { 24 "_id" : 0, 25 "title" : 1, 26 "genres" : 1, 27 "score" : { "$meta": "searchScore" } 28 } 29 } 30 ])
{ genres: [ 'Drama', 'Family', 'Sport' ], title: 'Alley Cats Strike', score: 1.2084882259368896 }, { genres: [ 'Drama', 'Romance', 'Sci-Fi' ], title: 'Allegro', score: 1.179288625717163 }, { genres: [ 'Animation', 'Comedy', 'Fantasy' ], title: 'Allegro non troppo', score: 1 }, { genres: [ 'Comedy' ], title: 'Allez, Eddy!', score: 1 }
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Connect to your cluster in MongoDB Compass.
Open MongoDB Compass and connect to your cluster. For detailed instructions on connecting, see Connect via Compass.
Run an Atlas Search diacritic-insensitive query.
This query uses the following compound
operator clauses to query the collection:
must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the $project
stage to:
Exclude all fields except
title
andgenres
Add a field named
score
To run this query in MongoDB Compass:
Click the Aggregations tab.
Click Select..., then configure each of the following pipeline stages by selecting the stage from the dropdown and adding the query for that stage. Click Add Stage to add additional stages.
Pipeline Stage | Query | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||||||||
|
|
If you enabled Auto Preview, MongoDB Compass displays the
following documents next to the $project
pipeline stage:
{ genres: [ 'Drama', 'Family', 'Sport' ], title: 'Alley Cats Strike', score: 1.2084882259368896 }, { genres: [ 'Drama', 'Romance', 'Sci-Fi' ], title: 'Allegro', score: 1.179288625717163 }, { genres: [ 'Animation', 'Comedy', 'Fantasy' ], title: 'Allegro non troppo', score: 1 }, { genres: [ 'Comedy' ], title: 'Allez, Eddy!', score: 1 }
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Set up and initialize the .NET/C# project for the query.
Create a new directory called
diacritic-insensitive-example
and initialize your project with thedotnet new
command.mkdir diacritic-insensitive-example cd diacritic-insensitive-example dotnet new console Add the .NET/C# Driver to your project as a dependency.
dotnet add package MongoDB.Driver
Create the query in the Program.cs
file.
Replace the contents of the
Program.cs
file with the following code.The code example performs the following tasks:
Imports
mongodb
packages and dependencies.Establishes a connection to your Atlas cluster.
Uses the following
compound
operator clauses to query the collection:must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the
$project
stage to:Exclude all fields except
title
andgenres
Add a field named
score
Iterates over the cursor to print the documents that match the query.
1 using MongoDB.Bson; 2 using MongoDB.Bson.Serialization.Attributes; 3 using MongoDB.Bson.Serialization.Conventions; 4 using MongoDB.Driver; 5 using MongoDB.Driver.Search; 6 7 public class DiacriticInsensitiveExample 8 { 9 private const string MongoConnectionString = "<connection-string>"; 10 11 public static void Main(string[] args) 12 { 13 // allow automapping of the camelCase database fields to our MovieDocument 14 var camelCaseConvention = new ConventionPack { new CamelCaseElementNameConvention() }; 15 ConventionRegistry.Register("CamelCase", camelCaseConvention, type => true); 16 17 // connect to your Atlas cluster 18 var mongoClient = new MongoClient(MongoConnectionString); 19 var mflixDatabase = mongoClient.GetDatabase("sample_mflix"); 20 var moviesCollection = mflixDatabase.GetCollection<MovieDocument>("movies"); 21 22 // define and run pipeline 23 var results = moviesCollection.Aggregate() 24 .Search(Builders<MovieDocument>.Search.Compound() 25 .Must(Builders<MovieDocument>.Search.Wildcard(movie => movie.Title, "allè*", true)) 26 .Should(Builders<MovieDocument>.Search.Text(movie => movie.Genres, "Drama")), 27 indexName: "diacritic-insensitive-tutorial") 28 .Project<MovieDocument>(Builders<MovieDocument>.Projection 29 .Include(movie => movie.Title) 30 .Include(movie => movie.Genres) 31 .Exclude(movie => movie.Id) 32 .MetaSearchScore(movie => movie.Score)) 33 .ToList(); 34 35 // print results 36 foreach (var movie in results) 37 { 38 Console.WriteLine(movie.ToJson()); 39 } 40 } 41 } 42 43 [ ]44 public class MovieDocument 45 { 46 [ ]47 public ObjectId Id { get; set; } 48 public string [] Genres { get; set; } 49 public string Title { get; set; } 50 public double Score { get; set; } 51 } Before you run the sample, replace
<connection-string>
with your Atlas connection string. Ensure that your connection string includes your database user's credentials. To learn more, see Connect via Drivers.
Compile and run the Program.cs
file.
dotnet run diacritic-insensitive-example.csproj
{ "genres" : ["Drama", "Family", "Sport"], "title" : "Alley Cats Strike", "score" : 1.2084882259368896 } { "genres" : ["Drama", "Romance", "Sci-Fi"], "title" : "Allegro", "score" : 1.1792886257171631 } { "genres" : ["Animation", "Comedy", "Fantasy"], "title" : "Allegro non troppo", "score" : 1.0 } { "genres" : ["Comedy"], "title" : "Allez, Eddy!", "score" : 1.0 }
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Run an Atlas Search diacritic-insensitive query.
Create a file named
diacritic-insensitive.go
.Copy and paste the following code into the
diacritic-insensitive.go
file.The code example performs the following tasks:
Imports
mongodb
packages and dependencies.Establishes a connection to your Atlas cluster.
Uses the following
compound
operator clauses to query the collection:must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the
$project
stage to:Exclude all fields except
title
andgenres
Add a field named
score
Iterates over the cursor to print the documents that match the query.
1 package main 2 3 import ( 4 "context" 5 "fmt" 6 7 "go.mongodb.org/mongo-driver/bson" 8 "go.mongodb.org/mongo-driver/mongo" 9 "go.mongodb.org/mongo-driver/mongo/options" 10 ) 11 12 func main() { 13 // connect to your Atlas cluster 14 client, err := mongo.Connect(context.TODO(), options.Client().ApplyURI("<connection-string>")) 15 if err != nil { 16 panic(err) 17 } 18 defer client.Disconnect(context.TODO()) 19 20 // set namespace 21 collection := client.Database("sample_mflix").Collection("movies") 22 23 // define pipeline stages 24 searchStage := bson.D{{"$search", bson.M{ 25 "index": "diacritic-insensitive-tutorial", 26 "compound": bson.M{ 27 "must": bson.M{ 28 "wildcard": bson.M{ 29 "path": "title", 30 "query": "allè*", 31 "allowAnalyzedField": true, 32 }, 33 }, 34 "should": bson.D{ 35 {"text", bson.M{ 36 "path": "genres", 37 "query": "Drama"}}}, 38 }, 39 }}} 40 projectStage := bson.D{{"$project", bson.D{{"title", 1}, {"genres", 1}, {"_id", 0}, {"score", bson.D{{"$meta", "searchScore"}}}}}} 41 42 // run pipeline 43 cursor, err := collection.Aggregate(context.TODO(), mongo.Pipeline{searchStage, projectStage}) 44 if err != nil { 45 panic(err) 46 } 47 48 // print results 49 var results []bson.D 50 if err = cursor.All(context.TODO(), &results); err != nil { 51 panic(err) 52 } 53 for _, result := range results { 54 fmt.Println(result) 55 } 56 } Before you run the sample, replace
<connection-string>
with your Atlas connection string. Ensure that your connection string includes your database user's credentials. To learn more, see Connect via Drivers.Run the following command to query your collection:
go run diacritic-insensitive.go [{genres [Drama Family Sport]} {title Alley Cats Strike} {score 1.2084882259368896}] [{genres [Drama Romance Sci-Fi]} {title Allegro} {score 1.179288625717163}] [{genres [Animation Comedy Fantasy]} {title Allegro non troppo} {score 1}] [{genres [Comedy]} {title Allez, Eddy!} {score 1}]
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Run an Atlas Search diacritic-insensitive query.
Create a file named
DiacriticInsensitive.java
.Copy and paste the following code into the
DiacriticInsensitive.java
file.The code example performs the following tasks:
Imports
mongodb
packages and dependencies.Establishes a connection to your Atlas cluster.
Uses the following
compound
operator clauses to query the collection:must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the
$project
stage to:Exclude all fields except
title
andgenres
Add a field named
score
Iterates over the cursor to print the documents that match the query.
1 import static com.mongodb.client.model.Aggregates.project; 2 import static com.mongodb.client.model.Projections.*; 3 import com.mongodb.client.MongoClient; 4 import com.mongodb.client.MongoClients; 5 import com.mongodb.client.MongoCollection; 6 import com.mongodb.client.MongoDatabase; 7 import org.bson.Document; 8 import java.util.Arrays; 9 import java.util.List; 10 11 public class tutorial { 12 public static void main(String[] args) { 13 // define clauses 14 List<Document> mustClauses = 15 List.of( new Document("wildcard", 16 new Document("path", "title") 17 .append("query", "allè*") 18 .append("allowAnalyzedField", true))); 19 List<Document> shouldClauses = 20 List.of( new Document("text", 21 new Document("query", "Drama") 22 .append("path", "genres"))); 23 // define pipeline 24 Document agg = new Document( "$search", 25 new Document("index", "diacritic-insensitive-tutorial") 26 .append("compound", 27 new Document("must", mustClauses) 28 .append("should", shouldClauses))); 29 30 // connect to your Atlas cluster 31 String uri = "<connection-string>"; 32 33 try (MongoClient mongoClient = MongoClients.create(uri)) { 34 // set namespace 35 MongoDatabase database = mongoClient.getDatabase("sample_mflix"); 36 MongoCollection<Document> collection = database.getCollection("movies"); 37 38 // run pipeline and print results 39 collection.aggregate(Arrays.asList(agg, 40 project(fields( 41 excludeId(), 42 include("title"), 43 include("genres"), 44 computed("score", new Document("$meta", "searchScore")))))) 45 .forEach(doc -> System.out.println(doc.toJson())); 46 } 47 } 48 } Note
To run the sample code in your Maven environment, add the following code above the import statements in your file.
package com.mongodb.drivers; Before you run the sample, replace
<connection-string>
with your Atlas connection string. Ensure that your connection string includes your database user's credentials. To learn more, see Connect via Drivers.Compile and run the
DiacriticInsensitive.java
file.javac DiacriticInsensitive.java java DiacriticInsensitive {"genres": ["Drama", "Family", "Sport"], "title": "Alley Cats Strike", "score": 1.2084882259368896} {"genres": ["Drama", "Romance", "Sci-Fi"], "title": "Allegro", "score": 1.179288625717163} {"genres": ["Animation", "Comedy", "Fantasy"], "title": "Allegro non troppo", "score": 1.0} {"genres": ["Comedy"], "title": "Allez, Eddy!", "score": 1.0}
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Run an Atlas Search diacritic-insensitive query.
Create a file named
DiacriticInsensitive.kt
.Copy and paste the following code into the
DiacriticInsensitive.kt
file.The code example performs the following tasks:
Imports
mongodb
packages and dependencies.Establishes a connection to your Atlas cluster.
Uses the following
compound
operator clauses to query the collection:must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the
$project
stage to:Exclude all fields except
title
andgenres
Add a field named
score
Prints the documents that match the query from the
AggregateFlow
instance.
1 import com.mongodb.client.model.Aggregates.project 2 import com.mongodb.client.model.Projections.* 3 import com.mongodb.kotlin.client.coroutine.MongoClient 4 import kotlinx.coroutines.runBlocking 5 import org.bson.Document 6 7 fun main() { 8 // connect to your Atlas cluster 9 val uri = "<connection-string>" 10 val mongoClient = MongoClient.create(uri) 11 12 // set namespace 13 val database = mongoClient.getDatabase("sample_mflix") 14 val collection = database.getCollection<Document>("movies") 15 16 runBlocking { 17 // define clauses 18 val mustClauses = listOf( 19 Document( 20 "wildcard", 21 Document("path", "title") 22 .append("query", "allè*") 23 .append("allowAnalyzedField", true) 24 ) 25 ) 26 27 val shouldClauses = listOf( 28 Document( 29 "text", 30 Document("query", "Drama") 31 .append("path", "genres") 32 ) 33 ) 34 35 // define pipeline 36 val agg = Document( "\$search", 37 Document("index", "diacritic-insensitive-tutorial") 38 .append("compound", Document("must", mustClauses) 39 .append("should", shouldClauses) 40 ) 41 ) 42 43 // run pipeline and print results 44 val resultsFlow = collection.aggregate<Document>( 45 listOf( 46 agg, 47 project(fields( 48 excludeId(), 49 include("title", "genres"), 50 computed("score", Document("\$meta", "searchScore")))) 51 ) 52 ) 53 resultsFlow.collect { println(it) } 54 } 55 56 mongoClient.close() 57 } Before you run the sample, replace
<connection-string>
with your Atlas connection string. Ensure that your connection string includes your database user's credentials. To learn more, see Connect via Drivers.Run the
DiacriticInsensitive.kt
file.When you run the
DiacriticInsensitive.kt
program in your IDE, it prints the following documents:Document{{genres=[Drama, Family, Sport], title=Alley Cats Strike, score=1.2084882259368896}} Document{{genres=[Drama, Romance, Sci-Fi], title=Allegro, score=1.179288625717163}} Document{{genres=[Animation, Comedy, Fantasy], title=Allegro non troppo, score=1.0}} Document{{genres=[Comedy], title=Allez, Eddy!, score=1.0}}
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Run an Atlas Search diacritic-insensitive query.
Create a file named
diacritic-insensitive.js
.Copy and paste the following code into the
diacritic-insensitive.js
file.The code example performs the following tasks:
Imports
mongodb
, MongoDB's Node.js driver.Creates an instance of the
MongoClient
class to establish a connection to your Atlas cluster.Uses the following
compound
operator clauses to query the collection:must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the
$project
stage to:Exclude all fields except
title
andgenres
Add a field named
score
Iterates over the cursor to print the documents that match the query.
1 const { MongoClient } = require("mongodb"); 2 3 // Replace the uri string with your MongoDB deployment's connection string. 4 const uri = 5 "<connection-string>"; 6 7 const client = new MongoClient(uri); 8 9 async function run() { 10 try { 11 await client.connect(); 12 13 // set namespace 14 const database = client.db("sample_mflix"); 15 const coll = database.collection("movies"); 16 17 // define pipeline 18 const agg = [{ 19 '$search': { 20 'index': 'diacritic-insensitive-tutorial', 21 'compound': { 22 'must': [{ 23 'wildcard': { 24 'query': "allè*", 25 'path': "title", 26 'allowAnalyzedField': true 27 } 28 }], 29 'should': [{'text': {'query': 'Drama', 'path': 'genres'}}] 30 }}}, 31 { '$project': { '_id': 0, 'title': 1 , 'genres': 1, 'score': {'$meta': 'searchScore'}}}]; 32 33 // run pipeline 34 const result = await coll.aggregate(agg); 35 36 // print results 37 await result.forEach((doc) => console.log(doc)); 38 39 } finally { 40 await client.close(); 41 } 42 } 43 run().catch(console.dir); Before you run the sample, replace
<connection-string>
with your Atlas connection string. Ensure that your connection string includes your database user's credentials. To learn more, see Connect via Drivers.Run the following command to query your collection:
node diacritic-insensitive.js { genres: [ 'Drama', 'Family', 'Sport' ], title: 'Alley Cats Strike', score: 1.2084882259368896 } { genres: [ 'Drama', 'Romance', 'Sci-Fi' ], title: 'Allegro', score: 1.179288625717163 } { genres: [ 'Animation', 'Comedy', 'Fantasy' ], title: 'Allegro non troppo', score: 1 } { genres: [ 'Comedy' ], title: 'Allez, Eddy!', score: 1 }
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.
Run an Atlas Search diacritic-insensitive query.
Create a file named
diacritic-insensitive.py
.Copy and paste the following code into the
diacritic-insensitive.py
file.The following code example:
Imports
pymongo
, MongoDB's Python driver, and thedns
module, which is required to connectpymongo
toAtlas
using a DNS seed list connection string.Creates an instance of the
MongoClient
class to establish a connection to your Atlas cluster.Uses the following
compound
operator clauses to query the collection:must
clause to search for movie titles that begin with the termallè
using the wildcard operatorshould
clause to specify preference for theDrama
genre using the text operator
The query uses the
$project
stage to:Exclude all fields except
title
andgenres
Add a field named
score
Iterates over the cursor to print the documents that match the query.
1 import pymongo 2 3 # connect to your Atlas cluster 4 client = pymongo.MongoClient('<connection-string>') 5 6 # define pipeline 7 pipeline = [ 8 {'$search': { 9 'index': 'diacritic-insensitive-tutorial', 10 'compound': { 11 'must': [{'wildcard': {'path': 'title', 'query': 'allè*', 'allowAnalyzedField': True}}], 12 'should': [{'text': {'query': 'Drama', 'path': 'genres'}}]}}}, 13 {'$project': {'_id': 0, 'title': 1, 'genres': 1, 'score': {'$meta': 'searchScore'}}} 14 ] 15 16 # run pipeline 17 result = client['sample_mflix']['movies'].aggregate(pipeline) 18 19 # print results 20 for i in result: 21 print(i) Before you run the sample, replace
<connection-string>
with your Atlas connection string. Ensure that your connection string includes your database user's credentials. To learn more, see Connect via Drivers.Run the following command to query your collection:
python diacritic-insensitive.py {'genres': ['Drama', 'Family', 'Sport'], 'title': 'Alley Cats Strike', 'score': 1.2084882259368896} {'genres': ['Drama', 'Romance', 'Sci-Fi'], 'title': 'Allegro', 'score': 1.179288625717163} {'genres': ['Animation', 'Comedy', 'Fantasy'], 'title': 'Allegro non troppo', 'score': 1.0} {'genres': ['Comedy'], 'title': 'Allez, Eddy!', 'score': 1.0}
The wildcard search for allè
returns documents where the
title
field starts with alle
even though it doesn't include
any diacritics, because the diacriticsFolder
custom analyzer we
used on the title
field applied character folding on its values. Atlas Search
returns documents with titles that begin with the query term allè
because we used the keyword tokenizer, which tokenizes entire strings
(or phrases) as a single token.
Alternatively, you can specify the standard tokenizer instead of the
keyword tokenizer in the custom analyzer used on the title field. For
the standard tokenizer, the Atlas Search results would contain documents with
titles that begin or appear anywhere at the beginning of the word for
the query term allè
such as "Desde allè". To test this, edit your
index definition to replace the
keyword
tokenizer on line 17 with the standard
tokenizer, save the
index definition, and run the sample query.