Smart Filtering: A Guide to Generating Pre-filters for Semantic Search

Vipul Bhardwaj, Fabian Valle20 min read • Published Sep 03, 2024 • Updated Oct 02, 2024

AI Python

FULL APPLICATION

Rate this tutorial

Ever searched for "old black and white comedies" only to be bombarded with a mix of modern action flicks? Frustrating, right? That’s the challenge with traditional search engines — they often struggle to understand the nuances of our queries, leaving us wading through irrelevant results.

This is where smart filtering comes in. It's a game-changer that uses metadata and vector search to deliver search results that truly match your intent. Imagine finding exactly the classic comedies you crave, without the hassle.

In this blog, we'll dive into what smart filtering is, how it works, and why it's essential for building better search experiences. Let's uncover the magic behind this technology and explore how it can revolutionize the way you search.

Understanding vector search

Vector search is a powerful tool that helps computers understand the meaning behind data, not just the words themselves. Instead of matching keywords, it focuses on the underlying concepts and relationships. Imagine searching for "dog" and getting results that include "puppy," "canine," and even images of dogs. That's the magic of vector search!

How does it work? Well, it transforms data into mathematical representations called vectors. These vectors are like coordinates on a map, and similar data points are closer together in this vector space. When you search for something, the system finds the vectors closest to your query, giving you results that are semantically similar.

While vector search is fantastic at understanding context, it sometimes falls short when it comes to simple filtering tasks. For instance, finding all movies released before 2000 requires precise filtering, not just semantic understanding. This is where smart filtering comes in to complement vector search.

The challenge of semantic search

While vector search brings us closer to understanding the true meaning of queries, there's still a gap between what users want and what search engines deliver. Complex search queries like "earliest comedy movies before 2000" can still be a challenge. Semantic search might understand the concepts of "comedy" and "movies," but it might struggle with the specifics of "earliest" and "before 2000."

This is where the results start to get messy. We might get a mix of old and new comedies, or even dramas that were mistakenly included. To truly satisfy users, we need a way to refine these search results and make them more precise. That's where pre-filters come into play.

What is smart filtering?

Smart filtering is the solution to this challenge. It's a technique that uses a dataset's metadata to create specific filters, refining search results and making them more accurate and efficient. By analyzing the information about your data, like its structure, content, and attributes, smart filtering can identify relevant criteria to filter your search.

Imagine searching for "comedy movies released before 2000." Smart filtering would use metadata like genre, release date, and potentially even plot keywords to create a filter that only includes movies matching those criteria. This way, you get a list of exactly what you want, without the irrelevant noise.

How smart filtering works

Smart filtering is a multi-step process that involves extracting information from your data, analyzing it, and creating specific filters based on your needs. Let's break it down:

Metadata extraction: The first step is to gather relevant information about your data. This includes details like:
- Data structure: How is the data organized (e.g., tables, documents)?
- Attributes: What kind of information is included (e.g., title, description, release date)?
- Data types: What format is the data in (e.g., text, numbers, dates)?
Pre-filter generation: Once you have the metadata, you can start creating pre-filters. These are specific conditions that data must meet to be included in the search results. For example, if you're searching for comedy movies released before 2000, you might create pre-filters for:
- Genre: comedy
- Release date: before 2000
Integration with vector search: The final step is to combine these pre-filters with your vector search. This ensures that the vector search only considers data points that match your pre-defined criteria.

By following these steps, smart filtering significantly improves the accuracy and efficiency of your search results.

Implementation

Notebook
GitHub repository
MongoDBAtlasVectorSearch from the LangChain framework and GPT-4o to implement smart filtering.

Pre-requisites

To be successful with this tutorial, you will need:

The IDE of your choosing. This tutorial uses a Jupyter notebook. Please feel free to run your commands directly from a notebook.
A MongoDB Atlas account.
A MongoDB Atlas cluster. The free tier will work perfectly.
An OpenAI API key. We will use OpenAI LLM to embed our data and generate filters. You will need access to:
- text-embedding-ada-002 embedding model.
- gpt-4o for text generation.
Python <4.0, >=3.8.1.

The following instructions are for running in a notebook but can be adapted to run in your IDE. There will just be some differences.

Install dependencies

Install the required dependencies.

1 !pip install pymongo==4.7.2 langchain-core==0.2.6 langchain-openai==0.1.7 langchain==0.2.1 langchain-community==0.2.4 lark==1.1.9

Sample data setup

For the purpose of this tutorial, we will use some sample movie data. We will define a list of LangChain documents. Each document represents one movie that has a page_content with movie description and some metadata with it. In the sample data, the metadata has release_date, rating, and genre.

1 from langchain_core.documents import Document
2 
3 # Sample movies data. The metadata has release_date, rating, genre, director
4 docs = [
5     Document(
6         page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
7         metadata={"release_date": "1994-04-15", "rating": 7.7, "genre": ["action", "scifi", "adventure"]},
8     ),
9     Document(
10         page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
11         metadata={"release_date": "2010-07-16", "rating": 8.2, "genre": ["action", "thriller"]},
12     ),
13     Document(
14         page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
15         metadata={"release_date": "2006-11-25", "rating": 8.6, "genre": ["anime", "thriller", "scifi"]},
16     ),
17     Document(
18         page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
19         metadata={"release_date": "2019-12-25", "rating": 8.3, "genre": ["romance", "drama", "comedy"]},
20     ),
21     Document(
22         page_content="Toys come alive and have a blast doing so",
23         metadata={"release_date": "1995-11-22", "genre": ["animation", "fantasy"]},
24     ),
25     Document(
26         page_content="The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.",
27         metadata={"release_date": "1999-11-24", "genre": ["animation", "adventure", "comedy"]},
28     ),
29     Document(
30         page_content="The toys face an uncertain future as they are accidentally donated to a daycare center, leading to a thrilling escape plan.",
31         metadata={"release_date": "2010-06-18", "genre": ["animation", "adventure", "comedy"]},
32     )
33 ]

MongoDB PyMongo client setup

Let us set up our MongoDB connection that we will use to insert our data.

1 import getpass
2 from pymongo import MongoClient
3 
4 # set up your MongoDB connection
5 connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
6 
7 client = MongoClient(connection_string)
8 
9 # name your database and collection anything you want since it will be created when you enter your data
10 database_name = "smart_filtering"
11 collection_name = "movies"
12 collection = client[database_name][collection_name]

VectorDB embedding object setup

For this tutorial, we are using OpenAI's text-embedding-ada-002 embedding model.

1 import json
2 
3 from langchain_openai import OpenAIEmbeddings
4 
5 # openAI API credentials
6 openai_api_base = getpass.getpass(prompt= "Put in OpenAI URL here")
7 openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")
8 
9 # default_headers is optional
10 default_headers = getpass.getpass(prompt= "Put in OpenAI API headers if applicable here")
11 default_headers = json.loads(default_headers) if default_headers else None
12 
13 embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key,
14                               openai_api_base=openai_api_base,
15                               default_headers=default_headers,
16                               model="text-embedding-ada-002")

Initialize the MongoDB collection

We will use MongoDBAtlasVectorSearch retriever from LangChain to embed and ingest our data. It will accept a list of docs, the embedding object we created earlier, and the MongoDB client. This step will initialize our MongoDB collection with the movies data with embeddings.

1 from langchain.vectorstores import MongoDBAtlasVectorSearch
2 
3 # This step will generate the embeddings and insert the documents
4 vectorStore = MongoDBAtlasVectorSearch.from_documents(docs, embeddings, collection=collection)

After initialization, the inserted data will look like the following:

![Flowchart showing the process of generating pre-filters for a user query. The flow begins with metadata filtering, where the user query and metadata are passed to a query constructor that creates a pre-filter query. This is then translated into a MongoDB query. The flow continues with a time range query constructor, implemented as an LLM agent, which uses a QueryExecutorMongoDB tool to query data from a vector database. Finally, the metadata filter and time-based query filter are merged to create the final pre-filter.][2] We have the page_content in the text field which is default for MongoDBAtlasVectorSearch vector store. The embeddings is an array of floats and saved under embedding. The metadata fields are present as release_date, rating, and genre.

Create search index

Before we can perform a search on our data, we need to create a search index. Follow these steps to create a search index.

In AtlasUI, go to your collection smart_filtering.movies.
Click on Search Indexes.
Click on Create Index.
Under Atlas Search select Json Editor.
Name your index as default and copy/paste the below index definition:

1 {
2   "analyzer": "lucene.standard",
3   "searchAnalyzer": "lucene.standard",
4   "mappings": {
5     "fields": {
6       "embedding": {
7         "type": "knnVector",
8         "dimensions": 1536,
9         "similarity": "cosine"
10       },
11       "rating": {
12         "type": "number"
13       },
14       "release_date": {
15         "type": "token"
16       },
17       "genre": {
18         "type": "token"
19       }
20     }
21   }
22 }

The index will take a couple of seconds to build. After the index builds successfully, we will be ready to query our data.

Query the data

Let’s say a user wants to find documents for the latest movie released before some date in the animation genre with this query: I want to watch a movie released before year 2000 in the animation genre with the latest release date.

We will try semantic search for this query and see what results we get.

1 vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
2 
3 query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
4 
5 docs = vectorStore.similarity_search(query)
6 for doc in docs:
7     print(doc.page_content)

Output: We received four movies in the output, three of which are not relevant to the user’s query. This is the problem with semantic search that we will solve using smart filtering.

1 A bunch of scientists bring back dinosaurs and mayhem breaks loose
2 A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea
3 Toys come alive and have a blast doing so
4 Leo DiCaprio gets lost in a dream within a dream within a dream within a ...

Let us analyze the filtering requirements from the user query: I want to watch a movie released before year 2000 in the animation genre with the latest release date.

There are two types of filter requirements in the user query and we will solve it in two stages.
We are performing the filtering in two stages because these are two different tasks and we should perform one task at a time with LLMs to get better results.

Stage 1 — metadata filter: A pre-filter can be generated based on the query and metadata.

Release before year 2000 can be a potential filter.
Animation genre can be a potential filter.

Stage 2 — time-based filter: We will also need to account for the latest release date. We will need to query the data to find the latest release date movie.

Stage 1: Metadata filter

We will be using LangChain’s load_query_constructor_runnable to generate our filter query and then we will be using MongoDBAtlasTranslator to convert the query to a valid MongoDB query. We will need the below to pass to load_query_constructor_runnable:

A description of the content of our data that will be passed in document_content
Metadata attributes of our data passed in attribute_info
Prompt with some examples that the LLM will use to generate the query

Metadata definition

For the purpose of this tutorial, we will define the metadata that we will be using for the filtering purpose. We will need to define the content and provide a document_content_description, name, description, and type of each field. This requires some basic understanding of the data.

Description of the content of our data:

1 document_content_description = "Brief summary of a movie"

Metadata attributes of our data:

1 from langchain.chains.query_constructor.base import AttributeInfo
2 
3 metadata_field_info = [
4     AttributeInfo(
5         name="genre",
6         description="Keywords for filtering: ['animation', 'action', 'comedy', 'romance', 'thriller']",
7         type="[string]",
8     ),
9     AttributeInfo(
10         name="release_date",
11         description="The date the movie was released on",
12         type="string",
13     ),
14     AttributeInfo(
15         name="rating", description="A 1-10 rating for the movie", type="float"
16     ),
17 ]

Define prompt for metadata filtering

Our goal is to extract meaningful information from the user query that we can use for the metadata filtering. We will pass the metadata of our data in the context such that the LLM gets an idea of the information that can be used as a filter. We will use a few-shot prompting technique to generate our results.

Few-shot prompting is a technique used with large language models, where the model is given a few examples of a task within the prompt to help guide it to produce the desired output.

Note: Please update the prompt as per your use case.

The below prompt will be passed as schema_prompt in LangChain’s load_query_constructor_runnable that will be used to generate the query.

The prompt will be used to instruct LLM on how to generate the query. We will use the prompt defined in the LangChain’s query constructor prompt but we will change it as per our use case.
Let’s break down the prompt and understand it:

We are using PromptTemplate to define our prompt.
In the beginning, we instruct the LLM to output the result in JSON format with the rewritten query and filter as keys.
We instruct the LLM to not include any information in the new query that is already accounted for in the filter.
The variables are wrapped in {} in the prompt that will be filled later.

1 from langchain_core.prompts import PromptTemplate
2 
3 
4 DEFAULT_SCHEMA = """\
5 << Structured Request Schema >>
6 When responding use a markdown code snippet with a JSON object formatted in the following schema:
7 
8 ```json
9 {{{{
10     "query": string \\ rewritten user's query after removing the information handled by the filter
11     "filter": string \\ logical condition statement for filtering documents
12 }}}}
13 ```
14 
15 The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well.
16 
17 A logical condition statement is composed of one or more comparison and logical operation statements.
18 
19 A comparison statement takes the form: `comp(attr, val)`:
20 - `comp` ({allowed_comparators}): comparator
21 - `attr` (string):  name of attribute to apply the comparison to
22 - `val` (string): is the comparison value
23 
24 A logical operation statement takes the form `op(statement1, statement2, ...)`:
25 - `op` ({allowed_operators}): logical operator
26 - `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to
27 
28 Make sure that you only use the comparators and logical operators listed above and no others.
29 Make sure that filters only refer to attributes that exist in the data source.
30 Make sure that filters only use the attributed names with its function names if there are functions applied on them.
31 Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
32 Make sure you understand the user's intent while generating a date filter. Use a range comparators such as gt | gte | lt | lte  for partial dates. 
33 Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored. 
34 Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.\
35 """
36 DEFAULT_SCHEMA_PROMPT = PromptTemplate.from_template(DEFAULT_SCHEMA)

Now, we will define some examples that we will use in our prompt. The examples will help the LLM to generate better results. We will need to define metadata for our example, a user query, and an expected answer.

Let’s define three data sources:

A songs data source that has a content description and attributes definition.

1 SONG_DATA_SOURCE = """\
2 ```json
3 {{
4     "content": "Lyrics of a song",
5     "attributes": {{
6         "artist": {{
7             "type": "string",
8             "description": "Name of the song artist"
9         }},
10         "length": {{
11             "type": "integer",
12             "description": "Length of the song in seconds"
13         }},
14         "genre": {{
15             "type": "[string]",
16             "description": "The song genre, one or many of [\"pop\", \"rock\" or \"rap\"]"
17         }},
18         "release_dt": {{
19             "type": "string",
20             "description": "Release date of the song."
21         }}
22     }}
23 }}
24 ```\
25 """

A movies data source. This is similar to our sample data that we are trying to solve. Adding it in the few-shot examples can improve our results.

1 MOVIES_DATA_SOURCE = """\
2 ```json
3 {{
4     "content": "Brief summary of a movie",
5     "attributes": {{
6         "release_date": {{
7             "type": "string",
8             "description": "The release date of the movie"
9         }},
10         "genre": {{
11             "type": "[string]",
12             "description": "Keywords for filtering: ['anime', 'action', 'comedy', 'romance', 'thriller']"
13         }}
14     }}
15 }}
16 ```\
17 """

A generic keyword data source. The LLM was struggling with generating correct query format for the keywords/array filtering so we added this to improve our results.

1 KEYWORDS_DATA_SOURCE = """\
2 ```json
3 {{
4     "content": "Documents store",
5     "attributes": {{
6         "tags": {{
7             "type": "[string]",
8             "description": "Keywords for filtering: ['rag', 'genai', 'gpt', 'langchain', 'llamaindex']"
9         }}
10     }}
11 }}
12 ````\
13 """

Note: Please add some examples as per your use case to enhance the results.

Now, let’s define some example user queries and expected answers:

1 KEYWORDS_DATA_SOURCE_ANSWER = """\
2 ```json
3 {{
4     "query": "Give me updates",
5     "filter": "in(\\"tags\\", [\\"rag\\", \\"langchain\\"])"
6 }}
7 ````\
8 """
9 
10 KEYWORDS_DATE_DATA_SOURCE_ANSWER = """\
11 ```json
12 {{
13     "query": "Tell me updates on connectors based on the latest documentation",
14     "filter": "in(\\"tags\\", [\\"langchain\\", \\"llamaindex\\"])"
15 }}
16 ````\
17 """
18 
19 FULL_ANSWER = """\
20 ```json
21 {{
22     "query": "songs about teenage romance",
23     "filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), in(\\"genre\\", [\\"pop\\"]), and(gt(\\"release_dt\\", \\"2010-12-31\\"), lt(\\"release_dt\\", \\"2020-01-01\\")))"
24 }}
25 ```\
26 """
27 
28 DATE_ANSWER = """\
29 ```json
30 {{
31     "query": "Recommend a movie with latest release date",
32     "filter": "and(lt(\\"release_date\\", \\"2010-01-01\\"), in(\\"genre\\", [\\"action\\", \\"thriller\\"])"
33 }}
34 ```\
35 """
36 
37 NO_FILTER_ANSWER = """\
38 ```json
39 {{
40     "query": "",
41     "filter": "NO_FILTER"
42 }}
43 ```\
44 """

Putting the above together, we can define our examples with the data source definition, user query, and expected answer:

1 DEFAULT_EXAMPLES = [
2     {
3         "i": 1,
4         "data_source": MOVIES_DATA_SOURCE,
5         "user_query": "Recommend an action or thriller genre movie release before 2010 and latest release date",
6         "structured_request": DATE_ANSWER,
7     },
8     {
9         "i": 2,
10         "data_source": MOVIES_DATA_SOURCE,
11         "user_query": "Recommend a latest movie",
12         "structured_request": NO_FILTER_ANSWER
13     },
14     {
15         "i": 3,
16         "data_source": SONG_DATA_SOURCE,
17         "user_query": "What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre released before 1 January 2020 and after 31 December, 2010",
18         "structured_request": FULL_ANSWER,
19     },
20     {
21         "i": 4,
22         "data_source": SONG_DATA_SOURCE,
23         "user_query": "What are songs that were not published on Spotify",
24         "structured_request": NO_FILTER_ANSWER,
25     },
26     {
27         "i": 5,
28         "data_source": KEYWORDS_DATA_SOURCE,
29         "user_query": "Give me updates on rag with langchain",
30         "structured_request": KEYWORDS_DATA_SOURCE_ANSWER
31     },
32     {
33         "i": 6,
34         "data_source": KEYWORDS_DATA_SOURCE,
35         "user_query": "Tell me updates on langchain and llamaindex connectors based on the latest documentation",
36         "structured_request": KEYWORDS_DATE_DATA_SOURCE_ANSWER
37     }
38 ]

Let us define a utility function that we will use to process our filters before returning.

1 def enforce_constraints(input_json):
2     def process_value(value):
3         if isinstance(value, (str, int)):
4             return value
5         elif isinstance(value, list) and all(isinstance(item, str) for item in value):
6             return value
7         elif isinstance(value, dict) and 'date' in value and isinstance(value['date'], str):
8             return value['date']
9         else:
10             raise ValueError("Invalid value type")
11 
12     def process_dict(d):
13         if not isinstance(d, dict):
14             return d
15         processed_dict = {}
16         for k, v in d.items():
17             if k.startswith("$") and isinstance(v, list):
18                 # Handling $and and $or conditions
19                 processed_dict[k] = [process_dict(item) for item in v]
20             elif k.startswith("$"):
21                 processed_dict[k] = process_value(v)
22             else:
23                 processed_dict[k] = process_dict(v)
24         return processed_dict
25 
26     return process_dict(input_json)

Now, let's go ahead and define our generate_metadata_filter function that we will be using to generate our metadata filters.

1 from langchain.chains.query_constructor.base import load_query_constructor_runnable
2 from langchain_community.query_constructors.mongodb_atlas import MongoDBAtlasTranslator
3 
4 translator = MongoDBAtlasTranslator()
5 allowed_operators = translator.allowed_operators
6 allowed_comparators = translator.allowed_comparators
7 
8 def generate_metadata_filter(query, llm, document_content_description, metadata_field_info, schema_prompt, examples):
9     """
10     This method will use the query constructor and generate the pre-filters for a list of datasets.
11     :param query: query submitted by the user
12     :param llm: llm instance
13     :param document_content_description: Data description
14     :param metadata_field_info: metadata fields information
15     :param schema_prompt: prompt instructions that will be passed to the llm
16     :param examples: list of examples
17     :return (dict): Returns pre-filter and new query for each dataset.
18     """
19     query_constructor = load_query_constructor_runnable(
20                 llm=llm,
21                 document_contents=document_content_description,
22                 attribute_info=metadata_field_info,
23                 schema_prompt=schema_prompt,
24                 examples=examples,
25                 allowed_comparators=allowed_comparators,
26                 allowed_operators=allowed_operators
27             )
28     query = f"""Answer the below question:\n
29             Question: {query}
30             """
31     structured_query = query_constructor.invoke(query)
32     new_query, new_kwargs = translator.visit_structured_query(structured_query)
33     pre_filter = enforce_constraints(new_kwargs)
34     pre_filter = pre_filter = pre_filter.get("pre_filter", {}) if pre_filter else {}
35     return pre_filter, new_query

Let us define our LLM object for text generation. We will use LangChain’s ChatOpenAI for this purpose. We will use the gpt-4o model for our filter generation.

1 from langchain_openai import ChatOpenAI
2 
3 llm = ChatOpenAI(openai_api_key=openai_api_key,
4                  openai_api_base=openai_api_base,
5                  default_headers=default_headers,
6                  model="gpt-4o")

Now that we have everything we need to generate the metadata filter, let’s give it a try.

1 query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
2 
3 pre_filter, new_query = generate_metadata_filter(
4     query=query,
5     llm=llm,
6     document_content_description=document_content_description,
7     metadata_field_info=metadata_field_info,
8     schema_prompt=DEFAULT_SCHEMA_PROMPT,
9     examples=DEFAULT_EXAMPLES)
10 
11 print(pre_filter)
12 print(new_query)
13 
14 vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
15 
16 docs = vectorStore.similarity_search(query=query, pre_filter=pre_filter)
17 for doc in docs:
18     print(doc.page_content)

Output:
Generated filter and new query:

1 {'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}
2 I want to watch a movie with the latest release date

As you can see in the generated filter, we were able to extract the information from the user’s query and the metadata, such as “release before year 2000” and “animation genre,” that can be used to pre-filter the data before running the semantic search.

Note that we are also returning a new query after removing the filters that we generated. This will be helpful in the next stage of filter generation.

O/P docs:

1 Toys come alive and have a blast doing so
2 The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.

We have received two documents in the output which is a better result than before. But we are still not able to get the “latest release date” movie.

And, to find the latest movie, we will need to query our data so we will move to Stage 2, filter generation.

Stage 2: Time-based filtering

The purpose of this stage is to generate the filters that can be used to find the movies with the latest release date.

Note: We will use the filter generated in the Stage 1 in this stage because we want to find the movie with the latest release date in movies released before 2000 and in the animation category.

We will need to query our MongoDB collection via LLM. For this purpose, we will be defining some tools that the LLM can use to query our data.

Tools

Let us define the tools to allow LLM to query our MongoDB collection.

1 import json
2 import logging
3 import os
4 import traceback
5 from typing import Dict, Optional, Type, Union, List
6 
7 from pymongo import MongoClient
8 from langchain_core.callbacks import CallbackManagerForToolRun
9 from langchain_core.pydantic_v1 import BaseModel, Field
10 from langchain_core.tools import BaseTool
11 
12 class MongoDBClient:
13     """Data helper for querying MongoDB Vector Indexes."""
14 
15     def __init__(self, collection):
16         self.collection = collection
17 
18     def run_aggregate_pipeline(self, pipeline: List[Dict]) -> List[Dict]:
19         documents = list(self.collection.aggregate(pipeline))
20         return documents
21 
22 class BaseMongoDBTool(BaseModel):
23     """Base tool for interacting with MongoDB."""
24 
25     client: MongoDBClient = Field(exclude=True)
26     match_filter: dict = Field(exclude=True)
27 
28     class Config(BaseTool.Config):
29         pass
30 
31 class _QueryExecutorMongoDBToolInput(BaseModel):
32     pipeline: str = Field(..., description="A valid MongoDB pipeline in JSON string format")
33 
34 class QueryExecutorMongoDBTool(BaseMongoDBTool, BaseTool):
35     name: str = "mongo_db_executor"
36     description: str = """
37     Input to this tool is a mongodb pipeline, output is a list of documents.
38     If the pipeline is not correct, an error message will be returned.
39     If an error is returned, report back to the user the issue and stop.
40     """
41     args_schema: Type[BaseModel] = _QueryExecutorMongoDBToolInput
42 
43     def _run(
44             self,
45             pipeline: str,
46             run_manager: Optional[CallbackManagerForToolRun] = None,
47     ) -> Union[List[Dict], str]:
48         """Get the result for the mongodb pipeline."""
49         try:
50             pipeline = json.loads(pipeline)
51             if self.match_filter:
52                 pipeline = [{"$match": self.match_filter}] + pipeline
53             print(f"Updated pipeline: {pipeline}/")
54             documents = self.client.run_aggregate_pipeline(pipeline)
55             return documents
56         except Exception as e:
57             """Format the error message"""
58             return f"Error: {e}\n{traceback.format_exc()}"

Prompt definition

In the second stage, we are only accounting for use cases where the user wants "latest," "recent," "first," or "last" type of queries. We will instruct our LLM to only generate an aggregation pipeline to generate filters for these types of queries.

1 SYSTEM_PROMPT_TEMPLATE = """
2 Your goal is to structure the user's query to match the request schema provided below.
3 
4 << Structured Request Schema >>
5 When responding use a markdown code snippet with a JSON object formatted in the following schema:
6 
7 ```json
8 {{{{
9     "query": string \\ rewritten user's query after removing the information handled by the filter
10     "filter": string \\ logical condition statement for filtering documents
11 }}}}

The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: comp(attr, val):

comp ('eq | ne | gt | gte | lt | lte | in | nin'): comparator
attr (string): name of attribute to apply the comparison to
val (string): is the comparison value

A logical operation statement takes the form op(statement1, statement2, ...):

op ('and | or'): logical operator
statement1, statement2, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

First step is to think about whether the user question mentions anything about date or time related that require a lookup in the MongoDB database. Words like "latest", "recent", "earliest", "first", "last" etc. in the query means a look up could be required. If no lookup is required, return "NO_FILTER" for the filter value.

If required, create a syntactically correct MongoDB aggregation pipeline using '$sort' and '$limit' operator to run. Use projection to only fetch the relevant date columns. Then look at the results of the aggregation pipeline and generate a date range query that can be used to filter relevant documents from the collection.

Make sure to only generate date-based filters. Make sure to only generate the query if a user asks about a time based question such as latest, most recent and not mention a specific date time. Make sure that you only use the comparators and logical operators listed above and no others. Make sure that filters only refer to date/time attributes that exist in the data source. Make sure that filters only use the attributed names with its function names if there are functions applied on them. Make sure that filters only use format YYYY-MM-DD when handling date data typed values. Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored. Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value. Make sure the column names in the filter query are in double quotes.

<< Data Source >>

1 {{{{
2     "content": {content_description},
3     "attributes": {attribute_info}
4 }}}}

"""

1 #### 
2 
3 Now, let's go ahead and define our `generate_time_based_filter` function that we will be using to generate our time-based filters. 
4 
5 We have handled the cases where only one stage filter generation is required or no filter generation is required.
6 
7 ```python
8 from typing import Tuple
9 
10 from langchain.agents import create_tool_calling_agent, AgentExecutor
11 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \
12     SystemMessagePromptTemplate
13 from pymongo.collection import Collection
14 from langchain.chains.query_constructor.base import AttributeInfo, _format_attribute_info
15 
16 
17 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \
18     SystemMessagePromptTemplate
19 from pymongo.collection import Collection
20 from langchain.chains.query_constructor.base import AttributeInfo
21 
22 
23 def generate_time_based_filter(llm: ChatOpenAI, collection: Collection, pre_filter: Dict, query: str, document_content_description: str, metadata_field_info: List[AttributeInfo]) -> Tuple[Dict, str]:
24     """
25     This function is responsible for generating filter query for "most recent", "latest", "earliest" type of user
26     questions.
27     :param llm: (ChatOpenAI) llm instance
28     :param collection: pymongo collection instance
29     :param pre_filter: (Dict) metadata pre-filter query
30     :param query: (str) user query
31     :param document_content_description: (str) description of data
32     :param metadata_field_info: (List[AttributeInfo]) list of metadata attributes information 
33     :return: (Tuple[Dict, str]) time-based filter query and re-written user query
34     """
35     client = MongoDBClient(collection=collection)
36     executor_tool = QueryExecutorMongoDBTool(client=client, match_filter=pre_filter)
37     tools = [executor_tool]
38     attribute_str = _format_attribute_info(metadata_field_info)
39     system_prompt_template = SYSTEM_PROMPT_TEMPLATE.format(attribute_info=attribute_str,
40                                                            content_description=document_content_description)
41 
42     prompt = ChatPromptTemplate(input_variables=["agent_scratchpad", "input"],
43                                 messages=[SystemMessagePromptTemplate(
44                                     prompt=PromptTemplate(input_variables=[], template=system_prompt_template)),
45                                     MessagesPlaceholder(variable_name="chat_history", optional=True),
46                                     HumanMessagePromptTemplate(
47                                         prompt=PromptTemplate(input_variables=["input"],
48                                                               template="{input}")),
49                                     MessagesPlaceholder(variable_name="agent_scratchpad")])
50 
51     agent = create_tool_calling_agent(llm, tools, prompt)
52     agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
53     structured_query = agent_executor.invoke({"input": query})
54     allowed_attributes = []
55     for ainfo in metadata_field_info:
56         allowed_attributes.append(
57             ainfo.name if isinstance(ainfo, AttributeInfo) else ainfo["name"]
58         )
59 
60     output_parser = StructuredQueryOutputParser.from_components(
61         allowed_comparators=translator.allowed_comparators,
62         allowed_operators=translator.allowed_operators,
63         allowed_attributes=allowed_attributes
64     )
65     structured_query = output_parser.parse(structured_query["output"])
66     new_query, new_kwargs = translator.visit_structured_query(structured_query)
67     time_based_pre_filter = enforce_constraints(new_kwargs)
68     time_based_pre_filter = time_based_pre_filter.get('pre_filter', {}) if time_based_pre_filter else {}
69     return time_based_pre_filter, new_query
70 ```
71 
72 #### Metadata and time-based filter generation
73 
74 Let’s run with both the filters now and check the result.
75 
76 ```python
77 query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
78 
79 pre_filter, new_query = generate_metadata_filter(
80     query=query,
81     llm=llm,
82     document_content_description=document_content_description,
83     metadata_field_info=metadata_field_info,
84     schema_prompt=DEFAULT_SCHEMA_PROMPT,
85     examples=DEFAULT_EXAMPLES)
86 
87 print(pre_filter)
88 print(new_query)
89 
90 time_based_pre_filter, final_query = generate_time_based_filter(
91     llm=llm,
92     collection=collection,
93     pre_filter=pre_filter,
94     query=new_query,
95     document_content_description=document_content_description,
96     metadata_field_info=metadata_field_info
97 )
98 print(time_based_pre_filter)
99 print(final_query)
100 ```
101 
102 Output:
103 
104 ```python
105 {'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}
106 I want to watch a movie with the latest release date
107 
108 {'release_date': {'$eq': '1999-11-24'}}
109 I want to watch a movie
110 ```
111 
112 Now that we have generated both stages’ filters, we can combine them using the `$and` operator to generate our final filter.
113 
114 ```python
115 # initialize final filter with stage 1 filter
116 final_pre_filter = pre_filter
117 if time_based_pre_filter:
118     # add time_based_filter if applicable
119     final_pre_filter = {"$and": [pre_filter, time_based_pre_filter]}
120 
121 print(final_pre_filter)
122 ```
123 
124 Output:
125 
126 ```python
127 {'$and': [{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}, {'release_date': {'$eq': '1999-11-24'}}]}
128 ```
129 
130 Now, let’s run a semantic search using the `final_pre_filter`.
131 
132 ```python
133 vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
134 
135 docs = vectorStore.similarity_search(query=query, pre_filter=final_pre_filter)
136 for doc in docs:
137     print(doc.page_content)
138 ```
139 
140 Output:
141 
142 ```bash
143 The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.
144 ```
145 
146 With smart filtering, we used both metadata and time-based filtering stages and were able to generate a filter that can be used to pre-filter the data before running a semantic search. We have received only the required documents in the end.
147 
148 ## Benefits of smart filtering
149 
150 Smart filtering brings a host of advantages to the table, making it a valuable tool for enhancing search experiences:
151 
152 * **Improved search accuracy:** By precisely targeting the data that matches your query, smart filtering dramatically increases the likelihood of finding relevant results. No more wading through irrelevant information.  
153 * **Faster search results:** Since smart filtering narrows down the search scope, the system can process information more efficiently, leading to quicker results.  
154 * **Enhanced user experience:** When users find what they're looking for quickly and easily, it leads to higher satisfaction and a better overall experience.  
155 * **Versatility:** Smart filtering can be applied to various domains, from e-commerce product searches to content recommendations, making it a versatile tool.
156 
157 By leveraging metadata and creating targeted pre-filters, smart filtering empowers you to deliver search results that truly meet user expectations.
158 
159 ## Conclusion
160 
161 Smart filtering is a powerful tool that transforms search experiences by bridging the gap between user intent and search results. By harnessing the power of metadata and vector search, it delivers more accurate, relevant, and efficient search outcomes.
162 
163 Whether you're building an e-commerce platform, a content recommendation system, or any application that relies on effective search, incorporating smart filtering can significantly enhance user satisfaction and drive better results.
164 
165 By understanding the fundamentals of smart filtering, you're equipped to explore its potential and implement it in your projects. So why wait? Start leveraging the power of smart filtering today and revolutionize your search game\!
166 
167 View the [full source code](https://github.com/bhardwaj-vipul/SmartFilteringRAG) for smart filtering using MongoDB Atlas.
168 
169 Check out additional resources: [Unlock the Power of Semantic Search With MongoDB Atlas Vector Search](https://www.mongodb.com/basics/semantic-search) and [Interactive RAG With MongoDB Atlas \+ Function Calling API](https://www.mongodb.com/developer/products/atlas/interactive-rag-mongodb-atlas-function-calling-api/). If you have any questions or want to show us what you are building, join us in the [MongoDB Community Forums](https://www.mongodb.com/community/forums/).
170 
171 
172   [1]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt07d47760e8c04d15/66d6dd2326adee3c5af681ed/image2.png
173   [2]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt6ce1fa4c1376e507/66d6dd233bf41e1189a41469/image1.png

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Tutorial

How to Use Cohere Embeddings and Rerank Modules With MongoDB Atlas

Mar 13, 2025 | 10 min read

Article

MongoDB ORMs, ODMs, and Libraries

Aug 28, 2024 | 3 min read

Tutorial

MongoDB Atlas Authentication Using Service Accounts (OAuth)

Jan 23, 2025 | 9 min read

Tutorial

Leveraging MongoDB Atlas Vector Search With LangChain

Sep 18, 2024 | 6 min read

Understanding vector search
The challenge of semantic search
What is smart filtering?
How smart filtering works
Implementation

Python

Smart Filtering: A Guide to Generating Pre-filters for Semantic Search

Understanding vector search

The challenge of semantic search

What is smart filtering?

How smart filtering works

Implementation

Pre-requisites

Install dependencies

Sample data setup

MongoDB PyMongo client setup

VectorDB embedding object setup

Initialize the MongoDB collection

Create search index

Query the data

Stage 1: Metadata filter

Metadata definition

Define prompt for metadata filtering

Stage 2: Time-based filtering

Tools

Prompt definition

Top Comments in Forums

Related

How to Use Cohere Embeddings and Rerank Modules With MongoDB Atlas

MongoDB ORMs, ODMs, and Libraries

MongoDB Atlas Authentication Using Service Accounts (OAuth)

Leveraging MongoDB Atlas Vector Search With LangChain

Table of Contents

1	from langchain_core.documents import Document
2
3	# Sample movies data. The metadata has release_date, rating, genre, director
4	docs = [
5	Document(
6	page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
7	metadata={"release_date": "1994-04-15", "rating": 7.7, "genre": ["action", "scifi", "adventure"]},
8	),
9	Document(
10	page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
11	metadata={"release_date": "2010-07-16", "rating": 8.2, "genre": ["action", "thriller"]},
12	),
13	Document(
14	page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
15	metadata={"release_date": "2006-11-25", "rating": 8.6, "genre": ["anime", "thriller", "scifi"]},
16	),
17	Document(
18	page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
19	metadata={"release_date": "2019-12-25", "rating": 8.3, "genre": ["romance", "drama", "comedy"]},
20	),
21	Document(
22	page_content="Toys come alive and have a blast doing so",
23	metadata={"release_date": "1995-11-22", "genre": ["animation", "fantasy"]},
24	),
25	Document(
26	page_content="The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.",
27	metadata={"release_date": "1999-11-24", "genre": ["animation", "adventure", "comedy"]},
28	),
29	Document(
30	page_content="The toys face an uncertain future as they are accidentally donated to a daycare center, leading to a thrilling escape plan.",
31	metadata={"release_date": "2010-06-18", "genre": ["animation", "adventure", "comedy"]},
32	)
33	]

1	import getpass
2	from pymongo import MongoClient
3
4	# set up your MongoDB connection
5	connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
6
7	client = MongoClient(connection_string)
8
9	# name your database and collection anything you want since it will be created when you enter your data
10	database_name = "smart_filtering"
11	collection_name = "movies"
12	collection = client[database_name][collection_name]

1	import json
2
3	from langchain_openai import OpenAIEmbeddings
4
5	# openAI API credentials
6	openai_api_base = getpass.getpass(prompt= "Put in OpenAI URL here")
7	openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")
8
9	# default_headers is optional
10	default_headers = getpass.getpass(prompt= "Put in OpenAI API headers if applicable here")
11	default_headers = json.loads(default_headers) if default_headers else None
12
13	embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key,
14	openai_api_base=openai_api_base,
15	default_headers=default_headers,
16	model="text-embedding-ada-002")

1	from langchain.vectorstores import MongoDBAtlasVectorSearch
2
3	# This step will generate the embeddings and insert the documents
4	vectorStore = MongoDBAtlasVectorSearch.from_documents(docs, embeddings, collection=collection)

1	{
2	"analyzer": "lucene.standard",
3	"searchAnalyzer": "lucene.standard",
4	"mappings": {
5	"fields": {
6	"embedding": {
7	"type": "knnVector",
8	"dimensions": 1536,
9	"similarity": "cosine"
10	},
11	"rating": {
12	"type": "number"
13	},
14	"release_date": {
15	"type": "token"
16	},
17	"genre": {
18	"type": "token"
19	}
20	}
21	}
22	}

1	vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
2
3	query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
4
5	docs = vectorStore.similarity_search(query)
6	for doc in docs:
7	print(doc.page_content)

1	A bunch of scientists bring back dinosaurs and mayhem breaks loose
2	A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea
3	Toys come alive and have a blast doing so
4	Leo DiCaprio gets lost in a dream within a dream within a dream within a ...

1	from langchain.chains.query_constructor.base import AttributeInfo
2
3	metadata_field_info = [
4	AttributeInfo(
5	name="genre",
6	description="Keywords for filtering: ['animation', 'action', 'comedy', 'romance', 'thriller']",
7	type="[string]",
8	),
9	AttributeInfo(
10	name="release_date",
11	description="The date the movie was released on",
12	type="string",
13	),
14	AttributeInfo(
15	name="rating", description="A 1-10 rating for the movie", type="float"
16	),
17	]

1	from langchain_core.prompts import PromptTemplate
2
3
4	DEFAULT_SCHEMA = """\
5	<< Structured Request Schema >>
6	When responding use a markdown code snippet with a JSON object formatted in the following schema:
7
8	```json
9	{{{{
10	"query": string \\ rewritten user's query after removing the information handled by the filter
11	"filter": string \\ logical condition statement for filtering documents
12	}}}}
13	```
14
15	The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well.
16
17	A logical condition statement is composed of one or more comparison and logical operation statements.
18
19	A comparison statement takes the form: `comp(attr, val)`:
20	- `comp` ({allowed_comparators}): comparator
21	- `attr` (string): name of attribute to apply the comparison to
22	- `val` (string): is the comparison value
23
24	A logical operation statement takes the form `op(statement1, statement2, ...)`:
25	- `op` ({allowed_operators}): logical operator
26	- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to
27
28	Make sure that you only use the comparators and logical operators listed above and no others.
29	Make sure that filters only refer to attributes that exist in the data source.
30	Make sure that filters only use the attributed names with its function names if there are functions applied on them.
31	Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
32	Make sure you understand the user's intent while generating a date filter. Use a range comparators such as gt \| gte \| lt \| lte for partial dates.
33	Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
34	Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.\
35	"""
36	DEFAULT_SCHEMA_PROMPT = PromptTemplate.from_template(DEFAULT_SCHEMA)

1	SONG_DATA_SOURCE = """\
2	```json
3	{{
4	"content": "Lyrics of a song",
5	"attributes": {{
6	"artist": {{
7	"type": "string",
8	"description": "Name of the song artist"
9	}},
10	"length": {{
11	"type": "integer",
12	"description": "Length of the song in seconds"
13	}},
14	"genre": {{
15	"type": "[string]",
16	"description": "The song genre, one or many of [\"pop\", \"rock\" or \"rap\"]"
17	}},
18	"release_dt": {{
19	"type": "string",
20	"description": "Release date of the song."
21	}}
22	}}
23	}}
24	```\
25	"""

1	MOVIES_DATA_SOURCE = """\
2	```json
3	{{
4	"content": "Brief summary of a movie",
5	"attributes": {{
6	"release_date": {{
7	"type": "string",
8	"description": "The release date of the movie"
9	}},
10	"genre": {{
11	"type": "[string]",
12	"description": "Keywords for filtering: ['anime', 'action', 'comedy', 'romance', 'thriller']"
13	}}
14	}}
15	}}
16	```\
17	"""

1	KEYWORDS_DATA_SOURCE = """\
2	```json
3	{{
4	"content": "Documents store",
5	"attributes": {{
6	"tags": {{
7	"type": "[string]",
8	"description": "Keywords for filtering: ['rag', 'genai', 'gpt', 'langchain', 'llamaindex']"
9	}}
10	}}
11	}}
12	````\
13	"""

1	KEYWORDS_DATA_SOURCE_ANSWER = """\
2	```json
3	{{
4	"query": "Give me updates",
5	"filter": "in(\\"tags\\", [\\"rag\\", \\"langchain\\"])"
6	}}
7	````\
8	"""
9
10	KEYWORDS_DATE_DATA_SOURCE_ANSWER = """\
11	```json
12	{{
13	"query": "Tell me updates on connectors based on the latest documentation",
14	"filter": "in(\\"tags\\", [\\"langchain\\", \\"llamaindex\\"])"
15	}}
16	````\
17	"""
18
19	FULL_ANSWER = """\
20	```json
21	{{
22	"query": "songs about teenage romance",
23	"filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), in(\\"genre\\", [\\"pop\\"]), and(gt(\\"release_dt\\", \\"2010-12-31\\"), lt(\\"release_dt\\", \\"2020-01-01\\")))"
24	}}
25	```\
26	"""
27
28	DATE_ANSWER = """\
29	```json
30	{{
31	"query": "Recommend a movie with latest release date",
32	"filter": "and(lt(\\"release_date\\", \\"2010-01-01\\"), in(\\"genre\\", [\\"action\\", \\"thriller\\"])"
33	}}
34	```\
35	"""
36
37	NO_FILTER_ANSWER = """\
38	```json
39	{{
40	"query": "",
41	"filter": "NO_FILTER"
42	}}
43	```\
44	"""

1	DEFAULT_EXAMPLES = [
2	{
3	"i": 1,
4	"data_source": MOVIES_DATA_SOURCE,
5	"user_query": "Recommend an action or thriller genre movie release before 2010 and latest release date",
6	"structured_request": DATE_ANSWER,
7	},
8	{
9	"i": 2,
10	"data_source": MOVIES_DATA_SOURCE,
11	"user_query": "Recommend a latest movie",
12	"structured_request": NO_FILTER_ANSWER
13	},
14	{
15	"i": 3,
16	"data_source": SONG_DATA_SOURCE,
17	"user_query": "What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre released before 1 January 2020 and after 31 December, 2010",
18	"structured_request": FULL_ANSWER,
19	},
20	{
21	"i": 4,
22	"data_source": SONG_DATA_SOURCE,
23	"user_query": "What are songs that were not published on Spotify",
24	"structured_request": NO_FILTER_ANSWER,
25	},
26	{
27	"i": 5,
28	"data_source": KEYWORDS_DATA_SOURCE,
29	"user_query": "Give me updates on rag with langchain",
30	"structured_request": KEYWORDS_DATA_SOURCE_ANSWER
31	},
32	{
33	"i": 6,
34	"data_source": KEYWORDS_DATA_SOURCE,
35	"user_query": "Tell me updates on langchain and llamaindex connectors based on the latest documentation",
36	"structured_request": KEYWORDS_DATE_DATA_SOURCE_ANSWER
37	}
38	]

1	def enforce_constraints(input_json):
2	def process_value(value):
3	if isinstance(value, (str, int)):
4	return value
5	elif isinstance(value, list) and all(isinstance(item, str) for item in value):
6	return value
7	elif isinstance(value, dict) and 'date' in value and isinstance(value['date'], str):
8	return value['date']
9	else:
10	raise ValueError("Invalid value type")
11
12	def process_dict(d):
13	if not isinstance(d, dict):
14	return d
15	processed_dict = {}
16	for k, v in d.items():
17	if k.startswith("$") and isinstance(v, list):
18	# Handling $and and $or conditions
19	processed_dict[k] = [process_dict(item) for item in v]
20	elif k.startswith("$"):
21	processed_dict[k] = process_value(v)
22	else:
23	processed_dict[k] = process_dict(v)
24	return processed_dict
25
26	return process_dict(input_json)

1	from langchain.chains.query_constructor.base import load_query_constructor_runnable
2	from langchain_community.query_constructors.mongodb_atlas import MongoDBAtlasTranslator
3
4	translator = MongoDBAtlasTranslator()
5	allowed_operators = translator.allowed_operators
6	allowed_comparators = translator.allowed_comparators
7
8	def generate_metadata_filter(query, llm, document_content_description, metadata_field_info, schema_prompt, examples):
9	"""
10	This method will use the query constructor and generate the pre-filters for a list of datasets.
11	:param query: query submitted by the user
12	:param llm: llm instance
13	:param document_content_description: Data description
14	:param metadata_field_info: metadata fields information
15	:param schema_prompt: prompt instructions that will be passed to the llm
16	:param examples: list of examples
17	:return (dict): Returns pre-filter and new query for each dataset.
18	"""
19	query_constructor = load_query_constructor_runnable(
20	llm=llm,
21	document_contents=document_content_description,
22	attribute_info=metadata_field_info,
23	schema_prompt=schema_prompt,
24	examples=examples,
25	allowed_comparators=allowed_comparators,
26	allowed_operators=allowed_operators
27	)
28	query = f"""Answer the below question:\n
29	Question: {query}
30	"""
31	structured_query = query_constructor.invoke(query)
32	new_query, new_kwargs = translator.visit_structured_query(structured_query)
33	pre_filter = enforce_constraints(new_kwargs)
34	pre_filter = pre_filter = pre_filter.get("pre_filter", {}) if pre_filter else {}
35	return pre_filter, new_query

1	from langchain_openai import ChatOpenAI
2
3	llm = ChatOpenAI(openai_api_key=openai_api_key,
4	openai_api_base=openai_api_base,
5	default_headers=default_headers,
6	model="gpt-4o")

1	query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
2
3	pre_filter, new_query = generate_metadata_filter(
4	query=query,
5	llm=llm,
6	document_content_description=document_content_description,
7	metadata_field_info=metadata_field_info,
8	schema_prompt=DEFAULT_SCHEMA_PROMPT,
9	examples=DEFAULT_EXAMPLES)
10
11	print(pre_filter)
12	print(new_query)
13
14	vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
15
16	docs = vectorStore.similarity_search(query=query, pre_filter=pre_filter)
17	for doc in docs:
18	print(doc.page_content)

1	{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}
2	I want to watch a movie with the latest release date

1	Toys come alive and have a blast doing so
2	The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.

1	import json
2	import logging
3	import os
4	import traceback
5	from typing import Dict, Optional, Type, Union, List
6
7	from pymongo import MongoClient
8	from langchain_core.callbacks import CallbackManagerForToolRun
9	from langchain_core.pydantic_v1 import BaseModel, Field
10	from langchain_core.tools import BaseTool
11
12	class MongoDBClient:
13	"""Data helper for querying MongoDB Vector Indexes."""
14
15	def __init__(self, collection):
16	self.collection = collection
17
18	def run_aggregate_pipeline(self, pipeline: List[Dict]) -> List[Dict]:
19	documents = list(self.collection.aggregate(pipeline))
20	return documents
21
22	class BaseMongoDBTool(BaseModel):
23	"""Base tool for interacting with MongoDB."""
24
25	client: MongoDBClient = Field(exclude=True)
26	match_filter: dict = Field(exclude=True)
27
28	class Config(BaseTool.Config):
29	pass
30
31	class _QueryExecutorMongoDBToolInput(BaseModel):
32	pipeline: str = Field(..., description="A valid MongoDB pipeline in JSON string format")
33
34	class QueryExecutorMongoDBTool(BaseMongoDBTool, BaseTool):
35	name: str = "mongo_db_executor"
36	description: str = """
37	Input to this tool is a mongodb pipeline, output is a list of documents.
38	If the pipeline is not correct, an error message will be returned.
39	If an error is returned, report back to the user the issue and stop.
40	"""
41	args_schema: Type[BaseModel] = _QueryExecutorMongoDBToolInput
42
43	def _run(
44	self,
45	pipeline: str,
46	run_manager: Optional[CallbackManagerForToolRun] = None,
47	) -> Union[List[Dict], str]:
48	"""Get the result for the mongodb pipeline."""
49	try:
50	pipeline = json.loads(pipeline)
51	if self.match_filter:
52	pipeline = [{"$match": self.match_filter}] + pipeline
53	print(f"Updated pipeline: {pipeline}/")
54	documents = self.client.run_aggregate_pipeline(pipeline)
55	return documents
56	except Exception as e:
57	"""Format the error message"""
58	return f"Error: {e}\n{traceback.format_exc()}"

1	SYSTEM_PROMPT_TEMPLATE = """
2	Your goal is to structure the user's query to match the request schema provided below.
3
4	<< Structured Request Schema >>
5	When responding use a markdown code snippet with a JSON object formatted in the following schema:
6
7	```json
8	{{{{
9	"query": string \\ rewritten user's query after removing the information handled by the filter
10	"filter": string \\ logical condition statement for filtering documents
11	}}}}

1	{{{{
2	"content": {content_description},
3	"attributes": {attribute_info}
4	}}}}

1	####
2
3	Now, let's go ahead and define our `generate_time_based_filter` function that we will be using to generate our time-based filters.
4
5	We have handled the cases where only one stage filter generation is required or no filter generation is required.
6
7	```python
8	from typing import Tuple
9
10	from langchain.agents import create_tool_calling_agent, AgentExecutor
11	from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \
12	SystemMessagePromptTemplate
13	from pymongo.collection import Collection
14	from langchain.chains.query_constructor.base import AttributeInfo, _format_attribute_info
15
16
17	from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \
18	SystemMessagePromptTemplate
19	from pymongo.collection import Collection
20	from langchain.chains.query_constructor.base import AttributeInfo
21
22
23	def generate_time_based_filter(llm: ChatOpenAI, collection: Collection, pre_filter: Dict, query: str, document_content_description: str, metadata_field_info: List[AttributeInfo]) -> Tuple[Dict, str]:
24	"""
25	This function is responsible for generating filter query for "most recent", "latest", "earliest" type of user
26	questions.
27	:param llm: (ChatOpenAI) llm instance
28	:param collection: pymongo collection instance
29	:param pre_filter: (Dict) metadata pre-filter query
30	:param query: (str) user query
31	:param document_content_description: (str) description of data
32	:param metadata_field_info: (List[AttributeInfo]) list of metadata attributes information
33	:return: (Tuple[Dict, str]) time-based filter query and re-written user query
34	"""
35	client = MongoDBClient(collection=collection)
36	executor_tool = QueryExecutorMongoDBTool(client=client, match_filter=pre_filter)
37	tools = [executor_tool]
38	attribute_str = _format_attribute_info(metadata_field_info)
39	system_prompt_template = SYSTEM_PROMPT_TEMPLATE.format(attribute_info=attribute_str,
40	content_description=document_content_description)
41
42	prompt = ChatPromptTemplate(input_variables=["agent_scratchpad", "input"],
43	messages=[SystemMessagePromptTemplate(
44	prompt=PromptTemplate(input_variables=[], template=system_prompt_template)),
45	MessagesPlaceholder(variable_name="chat_history", optional=True),
46	HumanMessagePromptTemplate(
47	prompt=PromptTemplate(input_variables=["input"],
48	template="{input}")),
49	MessagesPlaceholder(variable_name="agent_scratchpad")])
50
51	agent = create_tool_calling_agent(llm, tools, prompt)
52	agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
53	structured_query = agent_executor.invoke({"input": query})
54	allowed_attributes = []
55	for ainfo in metadata_field_info:
56	allowed_attributes.append(
57	ainfo.name if isinstance(ainfo, AttributeInfo) else ainfo["name"]
58	)
59
60	output_parser = StructuredQueryOutputParser.from_components(
61	allowed_comparators=translator.allowed_comparators,
62	allowed_operators=translator.allowed_operators,
63	allowed_attributes=allowed_attributes
64	)
65	structured_query = output_parser.parse(structured_query["output"])
66	new_query, new_kwargs = translator.visit_structured_query(structured_query)
67	time_based_pre_filter = enforce_constraints(new_kwargs)
68	time_based_pre_filter = time_based_pre_filter.get('pre_filter', {}) if time_based_pre_filter else {}
69	return time_based_pre_filter, new_query
70	```
71
72	#### Metadata and time-based filter generation
73
74	Let’s run with both the filters now and check the result.
75
76	```python
77	query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
78
79	pre_filter, new_query = generate_metadata_filter(
80	query=query,
81	llm=llm,
82	document_content_description=document_content_description,
83	metadata_field_info=metadata_field_info,
84	schema_prompt=DEFAULT_SCHEMA_PROMPT,
85	examples=DEFAULT_EXAMPLES)
86
87	print(pre_filter)
88	print(new_query)
89
90	time_based_pre_filter, final_query = generate_time_based_filter(
91	llm=llm,
92	collection=collection,
93	pre_filter=pre_filter,
94	query=new_query,
95	document_content_description=document_content_description,
96	metadata_field_info=metadata_field_info
97	)
98	print(time_based_pre_filter)
99	print(final_query)
100	```
101
102	Output:
103
104	```python
105	{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}
106	I want to watch a movie with the latest release date
107
108	{'release_date': {'$eq': '1999-11-24'}}
109	I want to watch a movie
110	```
111
112	Now that we have generated both stages’ filters, we can combine them using the `$and` operator to generate our final filter.
113
114	```python
115	# initialize final filter with stage 1 filter
116	final_pre_filter = pre_filter
117	if time_based_pre_filter:
118	# add time_based_filter if applicable
119	final_pre_filter = {"$and": [pre_filter, time_based_pre_filter]}
120
121	print(final_pre_filter)
122	```
123
124	Output:
125
126	```python
127	{'$and': [{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}, {'release_date': {'$eq': '1999-11-24'}}]}
128	```
129
130	Now, let’s run a semantic search using the `final_pre_filter`.
131
132	```python
133	vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
134
135	docs = vectorStore.similarity_search(query=query, pre_filter=final_pre_filter)
136	for doc in docs:
137	print(doc.page_content)
138	```
139
140	Output:
141
142	```bash
143	The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.
144	```
145
146	With smart filtering, we used both metadata and time-based filtering stages and were able to generate a filter that can be used to pre-filter the data before running a semantic search. We have received only the required documents in the end.
147
148	## Benefits of smart filtering
149
150	Smart filtering brings a host of advantages to the table, making it a valuable tool for enhancing search experiences:
151
152	* Improved search accuracy: By precisely targeting the data that matches your query, smart filtering dramatically increases the likelihood of finding relevant results. No more wading through irrelevant information.
153	* Faster search results: Since smart filtering narrows down the search scope, the system can process information more efficiently, leading to quicker results.
154	* Enhanced user experience: When users find what they're looking for quickly and easily, it leads to higher satisfaction and a better overall experience.
155	* Versatility: Smart filtering can be applied to various domains, from e-commerce product searches to content recommendations, making it a versatile tool.
156
157	By leveraging metadata and creating targeted pre-filters, smart filtering empowers you to deliver search results that truly meet user expectations.
158
159	## Conclusion
160
161	Smart filtering is a powerful tool that transforms search experiences by bridging the gap between user intent and search results. By harnessing the power of metadata and vector search, it delivers more accurate, relevant, and efficient search outcomes.
162
163	Whether you're building an e-commerce platform, a content recommendation system, or any application that relies on effective search, incorporating smart filtering can significantly enhance user satisfaction and drive better results.
164
165	By understanding the fundamentals of smart filtering, you're equipped to explore its potential and implement it in your projects. So why wait? Start leveraging the power of smart filtering today and revolutionize your search game\!
166
167	View the [full source code](https://github.com/bhardwaj-vipul/SmartFilteringRAG) for smart filtering using MongoDB Atlas.
168
169	Check out additional resources: [Unlock the Power of Semantic Search With MongoDB Atlas Vector Search](https://www.mongodb.com/basics/semantic-search) and [Interactive RAG With MongoDB Atlas \+ Function Calling API](https://www.mongodb.com/developer/products/atlas/interactive-rag-mongodb-atlas-function-calling-api/). If you have any questions or want to show us what you are building, join us in the [MongoDB Community Forums](https://www.mongodb.com/community/forums/).
170
171
172	[1]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt07d47760e8c04d15/66d6dd2326adee3c5af681ed/image2.png
173	[2]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt6ce1fa4c1376e507/66d6dd233bf41e1189a41469/image1.png