Query Analytics Part 2: Tuning the System
Rate this article
In Part 1: Know Your Queries, we demonstrated the importance of monitoring and tuning your search system and the dramatic effect it can have on your business. In this second part, we are going to delve into the technical techniques available to tune and adjust based on your query result analysis.
Query Analytics is available in public preview for all MongoDB Atlas clusters on an M10 or higher running MongoDB v5.0 or higher to view the analytics information for the tracked search terms in the Atlas UI. Atlas Search doesn't track search terms or display analytics for queries on free and shared-tier clusters.
Atlas Search Query Analytics focuses entirely on the frequency and number of results returned from each $search call. There are also a number of search metrics available for operational monitoring including CPU, memory, index size, and other useful data points.
There are a few big categories of actions we can take based on search query analysis insights, which are not mutually exclusive and often work in synergy with one another.
Let’s start with the user experience itself, from the search box down to the results presentation. You’re following up on a zero-results query: What did the user experience when this occurred? Are you only showing something like “Sorry, nothing found. Try again!”? Consider showing documents the user has previously engaged with, providing links, or automatically searching for looser queries, perhaps removing some of the user's query terms and asking, “Did you mean this?” While the user was typing the query, are you providing autosuggest/typeahead so that typos get corrected in the full search query?
For queries that return results, is there enough information provided in the user interface to allow the user to refine the results?
Consider these improvements:
- Add suggestions as the user is typing, which can be facilitated by leveraging ngrams via the autocomplete operator or building a specialized autocomplete collection and index for this purpose.
- Add faceted navigation, allowing the user to drill into specific categories and narrow the results shown.
How the queries are constructed is half the trick to getting great search results. (The other half is how your content is indexed.) The search terms the user entered are the key to the Query Analytics tracking, but behind the scenes, there’s much more to the full search request.
Your user interface provides the incoming search terms and likely, additional parameters. It’s up to the application tier to construct the $search-using aggregation pipeline from those parameters.
Here are some querying techniques that can influence the quality of the search results:
- Incorporate synonyms, perhaps in a relevancy-weighted fashion where non-synonymed clauses are boosted higher than clauses with synonyms added.
- Leverage compound.should clauses to allow the underlying relevancy computations to work their magic. Spreading query terms across multiple fields — with independent scoring boosts representing the importance, or weight, of each field — allows the best documents to bubble up higher in the results but still provides all matching documents to be returned. For example, a query of “the matrix” in the movies collection would benefit from boosting
title
higher thanplot
. - Use multi-analysis querying. Take advantage of a field being analyzed in multiple ways. Boost exact matches highest, and have less exact and fuzzier matches weighted lower. See the “Index configuration” section below.
Index configuration is the other half of great search results and relies on how the underlying search indexes are built from your documents. Here are some index configuration techniques to consider:
- Multi-analysis: Configure your main content fields to be analyzed/tokenized in various ways, ranging from exact (
token
type) to near-exact (lowercasedtoken
, diacritics normalized) to standard tokenized (whitespace and special characters ignored) to language-specific analysis, down to fuzzy. - Language considerations: If you know the language of the content, use that to your advantage by using the appropriate language analyzer. Consider doing this in a multi-analysis way so that at query time, you can incorporate language-specific considerations into the relevancy computations.
We’re going to highlight a few common Atlas Search-specific adjustments to consider.
Why didn’t “Jacky Chan” match any of the numerous movies that should have matched? First of all, his name is spelled “Jackie Chan,” so the user made a spelling mistake and we have no exact match of the misspelled name. (This is where $match will always fail, and a fuzzier search option is needed.) It turns out our app was doing
phrase
queries. We loosened this by adding in some additional compound.should
clauses using a fuzzy text
operator, and also went ahead and added a “jacky”/“jackie” synonym equivalency for good measure. By making these changes, over time, we will see that the number of occurrences for “Jacky Chan'' in the “Tracked Queries with No Results” will go down.The
text
operator provides query-time synonym expansion. Synonyms can be bi-directional or unidirectional. Bi-directional synonyms are called equivalent
in Atlas Search synonym mappings — for example, “car,” “automobile,” and “vehicle” — so a query containing any one of those terms would match documents containing any of the other terms, as well. These words are “equivalent” because they can all be used interchangeably. Uni-directional synonyms are explicit
mappings — say “animal” -> “dog” and “animal” -> “cat” — such that a query for “animal” will match documents with “cat” or “dog,” but a query for “dog” will only be for just that: “dog.”Using a single operator, like
text
over a wildcard path, facilitates findability (“recall” in information retrieval speak) but does not help with relevancy where the best matching documents bubble to the top of the results. An effective way to improve relevancy is to add variously boosted clauses to weight some fields higher than others.It’s generally a good idea to include a
text
operator within a compound.should
to allow for synonyms to come into play (the phrase
operator currently does not support synonym expansion) along with additional phrase
clauses that more precisely match what the user typed. Add fuzzy
to the text
operator to match in spite of slight typos/variations of words.You may note that Search Tester currently goes really wild with a wildcard
*
path to match across all textually analyzed fields; consider the field(s) that really make the most sense to be searched, and whether separate boosts should be assigned to them for fine-tuning relevancy. Using a *
wildcard is not going to give you the best relevancy because each field has the same boost weight. It can cause objectively bad results to get higher relevancy than they should. Further, a wildcard’s performance is impacted by how many fields you have across your entire collection, which may increase as you add documents.As an example, let’s suppose our search box powers movie search. Here’s what a relevancy-educated first pass looks like for a query of “purple rain,” generated from our application, first in prose: Consider query term (OR’d) matches in
title
, cast
, and plot
fields, boosting matches in those fields in that order, and let’s boost it all the way up to 11 when the query matches a phrase (the query terms in sequential order) of any of those fields.Now, in Atlas $search syntax, the main query operator becomes a
compound
of a handful of should
s with varying boosts:1 "compound": { 2 "should": [ 3 { 4 "text": { 5 "query": "purple rain", 6 "path": "title", 7 "score": { 8 "boost": { 9 "value": 3.0 10 } 11 } 12 } 13 }, 14 { 15 "text": { 16 "query": "purple rain", 17 "path": "cast", 18 "score": { 19 "boost": { 20 "value": 2.0 21 } 22 } 23 } 24 }, 25 { 26 "text": { 27 "query": "purple rain", 28 "path": "plot", 29 "score": { 30 "boost": { 31 "value": 1.0 32 } 33 } 34 } 35 }, 36 { 37 "phrase": { 38 "query": "purple rain", 39 "path": [ 40 "title", 41 "phrase", 42 "cast" 43 ], 44 "score": { 45 "boost": { 46 "value": 11.0 47 } 48 } 49 } 50 } 51 ] 52 }
Note the duplication of the user’s query in numerous places in that $search stage. This deserves a little bit of coding on your part, parameterizing values, providing easy, top-of-the code or config file adjustments to these boosting values, field names, and so on, to make creating these richer query clauses straightforward in your environment.
This kind of spreading a query across independently boosted fields is the first key to unlocking better relevancy in your searches. The next key is to query with different analyses, allowing various levels of exactness to fuzziness to have independent boosts, and again, these could be spread across differently weighted paths of fields.
The next section details creating multiple analyzers for fields; imagine plugging those into the
path
s of another bunch of should
clauses! Yes, you can get carried away with this technique, though you should start simple. Often, boosting fields independently and appropriately for your domain is all one needs for Pretty Good Findability and Relevancy.How your data is indexed determines whether, and how, it can be matched with queries, and thus affects the results your users experience. Adjusting field index configuration could change a search request from finding no documents to matching as expected (or vice versa!). Your index configuration is always a work in progress, and Query Analytics can help track that progress. It will evolve as your querying needs change.
If you’ve set up your index entirely with dynamic mappings, you’re off to a great start! You’ll be able to query your fields in data type-specific ways — numerically, by date ranges, filtering and matching, even regexing on string values. Most interesting is the query-ability of analyzed text. String field values are analyzed. By default, in dynamic mapping settings, each string field is analyzed using the
lucene.standard
analyzer. This analyzer does a generally decent job of splitting full-text strings into searchable terms (i.e., the “words” of the text). This analyzer doesn’t do any language-specific handling. So, for example, the words “find,” “finding,” and “finds” are all indexed as unique terms with standard/default analysis but would be indexed as the same stemmed term when using lucene.english
.Applying some domain- and data-specific knowledge, we can fine-tune how terms are indexed and thus how easily findable and relevant they are to the documents. Knowing that our movie
plot
is in English, we can switch the analyzer to lucene.english
, opening up the findability of movies with queries that come close to the English words in the actual plot
. Atlas Search has over 40 language-specific analyzers available.Query Analytics will point you to underperforming queries, but it’s up to you to make adjustments. To emphasize an important point that is being reiterated here in several ways, how your content is indexed affects how it can be queried, and the combination of both how content is indexed and how it is queried controls the order in which results are returned (also referred to as relevancy). One really useful technique available with Atlas Search is called Multi Analyzer, empowering each field to be indexed using any number of analyzer configurations. Each of these configurations is indexed independently (its own inverted index, term dictionary, and all that).
For example, we could index the title field for autocomplete purposes, and we could also index it as English text, then phonetically. We could also use our custom defined analyzer (see below) for term shingling, as well as our index-wide analyzer, defaulting to
lucene.standard
if not specified.1 "title": [ 2 { 3 "foldDiacritics": false, 4 "maxGrams": 7, 5 "minGrams": 3, 6 "tokenization": "nGram", 7 "type": "autocomplete" 8 }, 9 { 10 "multi": { 11 "english": { 12 "analyzer": "lucene.english", 13 "type": "string" 14 }, 15 "phonetic": { 16 "analyzer": "custom.phonetic", 17 "type": "string" 18 }, 19 "shingles": { 20 "analyzer": "custom.shingles", 21 "type": "string" 22 } 23 }, 24 "type": "string" 25 }
As they are indexed independently, they are also queryable independently. With this configuration, titles can be queried phonetically (“kat in the hat”), using English-aware stemming (“find nemo”), or with shingles (such that “the purple rain” queries can create “purple rain” phrase queries).
Explore the available built-in analyzers and give multi-indexing and querying a try. Sometimes, a little bit of custom analysis can really do the trick, so keep that technique in mind for a potent way to improve findability and relevancy. Here are our
custom.shingles
and custom.phonetic
analyzer definitions, but please don’t blindly copy this. Make sure you’re testing and understanding these adjustments as it relates to your data and types of queries:1 "analyzers": [ 2 { 3 "charFilters": [], 4 "name": "standard.shingles", 5 "tokenFilters": [ 6 { 7 "type": "lowercase" 8 }, 9 { 10 "maxShingleSize": 3, 11 "minShingleSize": 2, 12 "type": "shingle" 13 } 14 ], 15 "tokenizer": { 16 "type": "standard" 17 } 18 }, 19 { 20 "name": "phonetic", 21 "tokenFilters": [ 22 { 23 "originalTokens": "include", 24 "type": "daitchMokotoffSoundex" 25 } 26 ], 27 "tokenizer": { 28 "type": "standard" 29 } 30 } 31 ]
Querying will naturally still query the inverted index set up as the default for a field, unless the path specifies a “multi”.
A straightforward example to query specifically the
custom.phonetic
multi as we have defined it here looks like this:1 $search: { 2 "text": { 3 "query": "kat in the hat", 4 "path": { "value": "title", "multi": "custom.phonetic" } 5 } 6 }
Now, imagine combining this “multi” analysis with variously boosted
compound.should
clauses to achieve fine-grained findability and relevancy controls that are as nuanced as your domain deserves.Relevancy tuning pro-tip: Use a few clauses, one per multi-analyzed field independently, to boost from most exact (best!) to less exact, down to as fuzzy matching as needed.
All of these various tricks — from language analysis, stemming words, and a fuzzy parameter to match words that are close but not quite right and broadcasting query terms across multiple fields — are useful tools.
How do you go about incorporating Atlas Search Query Analytics into your application? It’s a fairly straightforward process of adding a small “tracking” section to your $search stage.
Queries containing the
tracking.searchTerms
structure are tracked (caveat to qualified cluster tier):1 { 2 $search: { 3 "tracking": { 4 "searchTerms": "<query to track>" 5 } 6 } 7 }
In Java, the tracking SearchOptions are constructed like this:
1 SearchOptions opts = SearchOptions.searchOptions() 2 .option("scoreDetails", BsonBoolean.TRUE) 3 .option("tracking", new Document("searchTerms", query_string));
If you’ve got a straightforward search box and that’s the only input provided for a search query, that query string is the best fit for the
searchTerms
value. In some cases, the query to track is more complicated or deserves more context. In doing some homework for this article, we met with one of our early adopters of the Query Analytics feature who was using tracking codes for the searchTerms
value, corresponding to another collection containing the full query context, such as a list of IP addresses being used for network intrusion detection.A simple addition of this tracking information opens the door to a greater understanding of the queries happening in your search system.
The specific adjustments that work best for your particular query challenges are where the art of this craft comes into play. There are many ways to improve a particular query’s results. We’ve shown several techniques to consider here. The main takeaways:
- Search is the gateway used to drive revenue, research, and engage users.
- Know what your users are experiencing, and use that insight to iterate improvements.
- Matching fuzzily and relevancy ranking results is both an art and science, and there are many options.
Atlas Search Query Analytics is a good first step in the virtuous search query management process.
This is part of a series