Query for field containing a string, case in-sensitive and diacritic in-sensitive

Michael_Sudnik · January 25, 2021, 2:36am

Hi,

I have a document that has lots of text fields. I also have a text index that includes all of the fields, which allows me to perform a case and diacritic insensitive search across all fields with ease.

However, I also want to perform an equivalent case and diacritic insensitive search across a single field, but am stuck. I cannot add a new text index, and I cannot use regex to be diacritic insensitive.

It is similar to this, but I need the diacritic insensitivity as well.

I thought that this must be a common scenario, but maybe I am missing an obvious solution?

If there is not an easy way to support this out of the box, then what are the recommended ways to do this?

Thanks for any help!

Pavel_Duchovny · January 25, 2021, 5:35am

Hi @Michael_Sudnik,

Welcome to MongoDB community.

Have you considered using a pipeline aggregation with first pipeline doing the text match and second/third pipeline doing a specific field filtering.

There is an example for similar behaviour here:

In your case you can run a. $regex on a specific filed match or add the field with $regexFind and match on only true values.

Thanks
Pavel

Prasad_Saya · January 25, 2021, 5:46am

Hello @Michael_Sudnik, welcome to the MongoDB Community forum.

You can use Collation to perform case and diacritic insensitive search on a single field. For example,

db.test.find()
{ "_id" : 1, "a" : "café" }
{ "_id" : 2, "a" : "Café" }
{ "_id" : 3, "a" : "Cafe" }
{ "_id" : 4, "a" : "cafe" }

db.test.find({ a: "cafe" }).collation( { locale: "en_US", strength: 1 } )
// returns all the four documents

Note that in the above query, the document field is not indexed.

You can also specify collation as an option when creating an index on a field (see Create Index - Collation as an Option.

If you do specify a collation when creating the index, MongoDB creates the index with the specified collation.

By specifying a collation strength of 1 or 2 , you can create a case-insensitive index. Index with a collation strength of 1 is both diacritic- and case-insensitive.

IMPORTANT: To use an index for string comparisons, an operation must also specify the same collation.

I think you cannot use case and diacritic insensitive search using a Regex search. You can only do a case-insensitive search. In case you need to do a search with both options, you can specify the query like this, for example:

db.test.find({ a: { $regex: "caf[eé]", $options: "i" }})

This matches documents like the following:

{ "_id" : 4, "a" : "cafe" }
{ "_id" : 3, "a" : "Cafe caffe" }
{ "_id" : 1, "a" : "café x" }
{ "_id" : 2, "a" : "y Café" }

Michael_Sudnik · January 25, 2021, 9:32am

Hi, thanks for your reply. Performing a find on a single field by specifying the collation (with or without an index) does not seem to perform a “containing” search. In your example, I would want to be able to search for “caf” and all 4 results should be found. Ideally, I would want to use an index as the data will get pretty big even after doing any additional initial match stages to reduce the data size.

A regex solution is an option, but I do not like the idea of the only option being to not use the index. It’s also not ideal to have to adjust my expressions to support different diacritic characters in this way.

It just feels like there should be a way to do a “contains” for your first option…

Michael_Sudnik · January 25, 2021, 9:32am

Have you considered using a pipeline aggregation with first pipeline doing the text match and second/third pipeline doing a specific field filtering.

So, in this case, results would be returned where the single field does not match the search at all. To filter the results additionally I would need to construct a case and diacritic insensitive search as described by @Prasad_Saya. I guess this is the best solution so far, in terms of performance and getting the functionality that I require.