Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
Python
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Languageschevron-right
Pythonchevron-right

Smart Filtering: A Guide to Generating Pre-filters for Semantic Search

Vipul Bhardwaj, Fabian Valle20 min read • Published Sep 03, 2024 • Updated Oct 02, 2024
AIPython
FULL APPLICATION
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Ever searched for "old black and white comedies" only to be bombarded with a mix of modern action flicks? Frustrating, right? That’s the challenge with traditional search engines — they often struggle to understand the nuances of our queries, leaving us wading through irrelevant results.
This is where smart filtering comes in. It's a game-changer that uses metadata and vector search to deliver search results that truly match your intent. Imagine finding exactly the classic comedies you crave, without the hassle.
In this blog, we'll dive into what smart filtering is, how it works, and why it's essential for building better search experiences. Let's uncover the magic behind this technology and explore how it can revolutionize the way you search.
Vector search is a powerful tool that helps computers understand the meaning behind data, not just the words themselves. Instead of matching keywords, it focuses on the underlying concepts and relationships. Imagine searching for "dog" and getting results that include "puppy," "canine," and even images of dogs. That's the magic of vector search!
How does it work? Well, it transforms data into mathematical representations called vectors. These vectors are like coordinates on a map, and similar data points are closer together in this vector space. When you search for something, the system finds the vectors closest to your query, giving you results that are semantically similar.
While vector search is fantastic at understanding context, it sometimes falls short when it comes to simple filtering tasks. For instance, finding all movies released before 2000 requires precise filtering, not just semantic understanding. This is where smart filtering comes in to complement vector search.
While vector search brings us closer to understanding the true meaning of queries, there's still a gap between what users want and what search engines deliver. Complex search queries like "earliest comedy movies before 2000" can still be a challenge. Semantic search might understand the concepts of "comedy" and "movies," but it might struggle with the specifics of "earliest" and "before 2000."
This is where the results start to get messy. We might get a mix of old and new comedies, or even dramas that were mistakenly included. To truly satisfy users, we need a way to refine these search results and make them more precise. That's where pre-filters come into play.

What is smart filtering?

![Flowchart showing the process of generating pre-filters for a user query. The flow begins with metadata filtering, where the user query and metadata are passed to a query constructor that creates a pre-filter query. This is then translated into a MongoDB query. The flow continues with a time range query constructor, implemented as an LLM agent, which uses a QueryExecutorMongoDB tool to query data from a vector database. Finally, the metadata filter and time-based query filter are merged to create the final pre-filter.][1]
Smart filtering is the solution to this challenge. It's a technique that uses a dataset's metadata to create specific filters, refining search results and making them more accurate and efficient. By analyzing the information about your data, like its structure, content, and attributes, smart filtering can identify relevant criteria to filter your search.
Imagine searching for "comedy movies released before 2000." Smart filtering would use metadata like genre, release date, and potentially even plot keywords to create a filter that only includes movies matching those criteria. This way, you get a list of exactly what you want, without the irrelevant noise.

How smart filtering works

Smart filtering is a multi-step process that involves extracting information from your data, analyzing it, and creating specific filters based on your needs. Let's break it down:
  1. Metadata extraction: The first step is to gather relevant information about your data. This includes details like:
    • Data structure: How is the data organized (e.g., tables, documents)?
    • Attributes: What kind of information is included (e.g., title, description, release date)?
    • Data types: What format is the data in (e.g., text, numbers, dates)?
  2. Pre-filter generation: Once you have the metadata, you can start creating pre-filters. These are specific conditions that data must meet to be included in the search results. For example, if you're searching for comedy movies released before 2000, you might create pre-filters for:
    • Genre: comedy
    • Release date: before 2000
  3. Integration with vector search: The final step is to combine these pre-filters with your vector search. This ensures that the vector search only considers data points that match your pre-defined criteria.
By following these steps, smart filtering significantly improves the accuracy and efficiency of your search results.

Implementation

  • Notebook
  • MongoDBAtlasVectorSearch from the LangChain framework and GPT-4o to implement smart filtering.

Pre-requisites

To be successful with this tutorial, you will need:
  • The IDE of your choosing. This tutorial uses a Jupyter notebook. Please feel free to run your commands directly from a notebook.
  • A MongoDB Atlas account.
  • A MongoDB Atlas cluster. The free tier will work perfectly.
  • An OpenAI API key. We will use OpenAI LLM to embed our data and generate filters. You will need access to:
    • text-embedding-ada-002 embedding model.
    • gpt-4o for text generation.
  • Python <4.0, >=3.8.1.
The following instructions are for running in a notebook but can be adapted to run in your IDE. There will just be some differences.

Install dependencies

Install the required dependencies.
1!pip install pymongo==4.7.2 langchain-core==0.2.6 langchain-openai==0.1.7 langchain==0.2.1 langchain-community==0.2.4 lark==1.1.9

Sample data setup

For the purpose of this tutorial, we will use some sample movie data. We will define a list of LangChain documents. Each document represents one movie that has a page_content with movie description and some metadata with it. In the sample data, the metadata has release_date, rating, and genre.
1from langchain_core.documents import Document
2
3# Sample movies data. The metadata has release_date, rating, genre, director
4docs = [
5 Document(
6 page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
7 metadata={"release_date": "1994-04-15", "rating": 7.7, "genre": ["action", "scifi", "adventure"]},
8 ),
9 Document(
10 page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
11 metadata={"release_date": "2010-07-16", "rating": 8.2, "genre": ["action", "thriller"]},
12 ),
13 Document(
14 page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
15 metadata={"release_date": "2006-11-25", "rating": 8.6, "genre": ["anime", "thriller", "scifi"]},
16 ),
17 Document(
18 page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
19 metadata={"release_date": "2019-12-25", "rating": 8.3, "genre": ["romance", "drama", "comedy"]},
20 ),
21 Document(
22 page_content="Toys come alive and have a blast doing so",
23 metadata={"release_date": "1995-11-22", "genre": ["animation", "fantasy"]},
24 ),
25 Document(
26 page_content="The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.",
27 metadata={"release_date": "1999-11-24", "genre": ["animation", "adventure", "comedy"]},
28 ),
29 Document(
30 page_content="The toys face an uncertain future as they are accidentally donated to a daycare center, leading to a thrilling escape plan.",
31 metadata={"release_date": "2010-06-18", "genre": ["animation", "adventure", "comedy"]},
32 )
33]

MongoDB PyMongo client setup

Let us set up our MongoDB connection that we will use to insert our data.
1import getpass
2from pymongo import MongoClient
3
4# set up your MongoDB connection
5connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
6
7client = MongoClient(connection_string)
8
9# name your database and collection anything you want since it will be created when you enter your data
10database_name = "smart_filtering"
11collection_name = "movies"
12collection = client[database_name][collection_name]

VectorDB embedding object setup

For this tutorial, we are using OpenAI's text-embedding-ada-002 embedding model.
1import json
2
3from langchain_openai import OpenAIEmbeddings
4
5# openAI API credentials
6openai_api_base = getpass.getpass(prompt= "Put in OpenAI URL here")
7openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")
8
9# default_headers is optional
10default_headers = getpass.getpass(prompt= "Put in OpenAI API headers if applicable here")
11default_headers = json.loads(default_headers) if default_headers else None
12
13embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key,
14 openai_api_base=openai_api_base,
15 default_headers=default_headers,
16 model="text-embedding-ada-002")

Initialize the MongoDB collection

We will use MongoDBAtlasVectorSearch retriever from LangChain to embed and ingest our data. It will accept a list of docs, the embedding object we created earlier, and the MongoDB client. This step will initialize our MongoDB collection with the movies data with embeddings.
1from langchain.vectorstores import MongoDBAtlasVectorSearch
2
3# This step will generate the embeddings and insert the documents
4vectorStore = MongoDBAtlasVectorSearch.from_documents(docs, embeddings, collection=collection)
After initialization, the inserted data will look like the following:
![Flowchart showing the process of generating pre-filters for a user query. The flow begins with metadata filtering, where the user query and metadata are passed to a query constructor that creates a pre-filter query. This is then translated into a MongoDB query. The flow continues with a time range query constructor, implemented as an LLM agent, which uses a QueryExecutorMongoDB tool to query data from a vector database. Finally, the metadata filter and time-based query filter are merged to create the final pre-filter.][2] We have the page_content in the text field which is default for MongoDBAtlasVectorSearch vector store. The embeddings is an array of floats and saved under embedding. The metadata fields are present as release_date, rating, and genre.

Create search index

Before we can perform a search on our data, we need to create a search index. Follow these steps to create a search index.
  • In AtlasUI, go to your collection smart_filtering.movies.
  • Click on Search Indexes.
  • Click on Create Index.
  • Under Atlas Search select Json Editor.
  • Name your index as default and copy/paste the below index definition:
1{
2 "analyzer": "lucene.standard",
3 "searchAnalyzer": "lucene.standard",
4 "mappings": {
5 "fields": {
6 "embedding": {
7 "type": "knnVector",
8 "dimensions": 1536,
9 "similarity": "cosine"
10 },
11 "rating": {
12 "type": "number"
13 },
14 "release_date": {
15 "type": "token"
16 },
17 "genre": {
18 "type": "token"
19 }
20 }
21 }
22}
The index will take a couple of seconds to build. After the index builds successfully, we will be ready to query our data.

Query the data

Let’s say a user wants to find documents for the latest movie released before some date in the animation genre with this query: I want to watch a movie released before year 2000 in the animation genre with the latest release date.
We will try semantic search for this query and see what results we get.
1vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
2
3query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
4
5docs = vectorStore.similarity_search(query)
6for doc in docs:
7 print(doc.page_content)
Output: We received four movies in the output, three of which are not relevant to the user’s query. This is the problem with semantic search that we will solve using smart filtering.
1A bunch of scientists bring back dinosaurs and mayhem breaks loose
2A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea
3Toys come alive and have a blast doing so
4Leo DiCaprio gets lost in a dream within a dream within a dream within a ...
Let us analyze the filtering requirements from the user query: I want to watch a movie released before year 2000 in the animation genre with the latest release date.
There are two types of filter requirements in the user query and we will solve it in two stages.
We are performing the filtering in two stages because these are two different tasks and we should perform one task at a time with LLMs to get better results.
Stage 1 — metadata filter: A pre-filter can be generated based on the query and metadata.
  • Release before year 2000 can be a potential filter.
  • Animation genre can be a potential filter.
Stage 2 — time-based filter: We will also need to account for the latest release date. We will need to query the data to find the latest release date movie.

Stage 1: Metadata filter

We will be using LangChain’s load_query_constructor_runnable to generate our filter query and then we will be using MongoDBAtlasTranslator to convert the query to a valid MongoDB query. We will need the below to pass to load_query_constructor_runnable:
  • A description of the content of our data that will be passed in document_content
  • Metadata attributes of our data passed in attribute_info
  • Prompt with some examples that the LLM will use to generate the query

Metadata definition

For the purpose of this tutorial, we will define the metadata that we will be using for the filtering purpose. We will need to define the content and provide a document_content_description, name, description, and type of each field. This requires some basic understanding of the data.
Description of the content of our data:
1document_content_description = "Brief summary of a movie"
Metadata attributes of our data:
1from langchain.chains.query_constructor.base import AttributeInfo
2
3metadata_field_info = [
4 AttributeInfo(
5 name="genre",
6 description="Keywords for filtering: ['animation', 'action', 'comedy', 'romance', 'thriller']",
7 type="[string]",
8 ),
9 AttributeInfo(
10 name="release_date",
11 description="The date the movie was released on",
12 type="string",
13 ),
14 AttributeInfo(
15 name="rating", description="A 1-10 rating for the movie", type="float"
16 ),
17]

Define prompt for metadata filtering

Our goal is to extract meaningful information from the user query that we can use for the metadata filtering. We will pass the metadata of our data in the context such that the LLM gets an idea of the information that can be used as a filter. We will use a few-shot prompting technique to generate our results.
Few-shot prompting is a technique used with large language models, where the model is given a few examples of a task within the prompt to help guide it to produce the desired output.
Note: Please update the prompt as per your use case.
The below prompt will be passed as schema_prompt in LangChain’s load_query_constructor_runnable that will be used to generate the query.
The prompt will be used to instruct LLM on how to generate the query. We will use the prompt defined in the LangChain’s query constructor prompt but we will change it as per our use case.
Let’s break down the prompt and understand it:
  • We are using PromptTemplate to define our prompt.
  • In the beginning, we instruct the LLM to output the result in JSON format with the rewritten query and filter as keys.
  • We instruct the LLM to not include any information in the new query that is already accounted for in the filter.
  • The variables are wrapped in {} in the prompt that will be filled later.
1from langchain_core.prompts import PromptTemplate
2
3
4DEFAULT_SCHEMA = """\
5<< Structured Request Schema >>
6When responding use a markdown code snippet with a JSON object formatted in the following schema:
7
8```json
9{{{{
10 "query": string \\ rewritten user's query after removing the information handled by the filter
11 "filter": string \\ logical condition statement for filtering documents
12}}}}
13```
14
15The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well.
16
17A logical condition statement is composed of one or more comparison and logical operation statements.
18
19A comparison statement takes the form: `comp(attr, val)`:
20- `comp` ({allowed_comparators}): comparator
21- `attr` (string): name of attribute to apply the comparison to
22- `val` (string): is the comparison value
23
24A logical operation statement takes the form `op(statement1, statement2, ...)`:
25- `op` ({allowed_operators}): logical operator
26- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to
27
28Make sure that you only use the comparators and logical operators listed above and no others.
29Make sure that filters only refer to attributes that exist in the data source.
30Make sure that filters only use the attributed names with its function names if there are functions applied on them.
31Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
32Make sure you understand the user's intent while generating a date filter. Use a range comparators such as gt | gte | lt | lte for partial dates.
33Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
34Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.\
35"""
36DEFAULT_SCHEMA_PROMPT = PromptTemplate.from_template(DEFAULT_SCHEMA)
Now, we will define some examples that we will use in our prompt. The examples will help the LLM to generate better results. We will need to define metadata for our example, a user query, and an expected answer.
Let’s define three data sources:
  1. A songs data source that has a content description and attributes definition.
1SONG_DATA_SOURCE = """\
2```json
3{{
4 "content": "Lyrics of a song",
5 "attributes": {{
6 "artist": {{
7 "type": "string",
8 "description": "Name of the song artist"
9 }},
10 "length": {{
11 "type": "integer",
12 "description": "Length of the song in seconds"
13 }},
14 "genre": {{
15 "type": "[string]",
16 "description": "The song genre, one or many of [\"pop\", \"rock\" or \"rap\"]"
17 }},
18 "release_dt": {{
19 "type": "string",
20 "description": "Release date of the song."
21 }}
22 }}
23}}
24```\
25"""
  1. A movies data source. This is similar to our sample data that we are trying to solve. Adding it in the few-shot examples can improve our results.
1MOVIES_DATA_SOURCE = """\
2```json
3{{
4 "content": "Brief summary of a movie",
5 "attributes": {{
6 "release_date": {{
7 "type": "string",
8 "description": "The release date of the movie"
9 }},
10 "genre": {{
11 "type": "[string]",
12 "description": "Keywords for filtering: ['anime', 'action', 'comedy', 'romance', 'thriller']"
13 }}
14 }}
15}}
16```\
17"""
  1. A generic keyword data source. The LLM was struggling with generating correct query format for the keywords/array filtering so we added this to improve our results.
1KEYWORDS_DATA_SOURCE = """\
2```json
3{{
4 "content": "Documents store",
5 "attributes": {{
6 "tags": {{
7 "type": "[string]",
8 "description": "Keywords for filtering: ['rag', 'genai', 'gpt', 'langchain', 'llamaindex']"
9 }}
10 }}
11}}
12````\
13"""
Note: Please add some examples as per your use case to enhance the results.
Now, let’s define some example user queries and expected answers:
1KEYWORDS_DATA_SOURCE_ANSWER = """\
2```json
3{{
4 "query": "Give me updates",
5 "filter": "in(\\"tags\\", [\\"rag\\", \\"langchain\\"])"
6}}
7````\
8"""
9
10KEYWORDS_DATE_DATA_SOURCE_ANSWER = """\
11```json
12{{
13 "query": "Tell me updates on connectors based on the latest documentation",
14 "filter": "in(\\"tags\\", [\\"langchain\\", \\"llamaindex\\"])"
15}}
16````\
17"""
18
19FULL_ANSWER = """\
20```json
21{{
22 "query": "songs about teenage romance",
23 "filter": "and(or(eq(\\"artist\\", \\"Taylor Swift\\"), eq(\\"artist\\", \\"Katy Perry\\")), lt(\\"length\\", 180), in(\\"genre\\", [\\"pop\\"]), and(gt(\\"release_dt\\", \\"2010-12-31\\"), lt(\\"release_dt\\", \\"2020-01-01\\")))"
24}}
25```\
26"""
27
28DATE_ANSWER = """\
29```json
30{{
31 "query": "Recommend a movie with latest release date",
32 "filter": "and(lt(\\"release_date\\", \\"2010-01-01\\"), in(\\"genre\\", [\\"action\\", \\"thriller\\"])"
33}}
34```\
35"""
36
37NO_FILTER_ANSWER = """\
38```json
39{{
40 "query": "",
41 "filter": "NO_FILTER"
42}}
43```\
44"""
Putting the above together, we can define our examples with the data source definition, user query, and expected answer:
1DEFAULT_EXAMPLES = [
2 {
3 "i": 1,
4 "data_source": MOVIES_DATA_SOURCE,
5 "user_query": "Recommend an action or thriller genre movie release before 2010 and latest release date",
6 "structured_request": DATE_ANSWER,
7 },
8 {
9 "i": 2,
10 "data_source": MOVIES_DATA_SOURCE,
11 "user_query": "Recommend a latest movie",
12 "structured_request": NO_FILTER_ANSWER
13 },
14 {
15 "i": 3,
16 "data_source": SONG_DATA_SOURCE,
17 "user_query": "What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre released before 1 January 2020 and after 31 December, 2010",
18 "structured_request": FULL_ANSWER,
19 },
20 {
21 "i": 4,
22 "data_source": SONG_DATA_SOURCE,
23 "user_query": "What are songs that were not published on Spotify",
24 "structured_request": NO_FILTER_ANSWER,
25 },
26 {
27 "i": 5,
28 "data_source": KEYWORDS_DATA_SOURCE,
29 "user_query": "Give me updates on rag with langchain",
30 "structured_request": KEYWORDS_DATA_SOURCE_ANSWER
31 },
32 {
33 "i": 6,
34 "data_source": KEYWORDS_DATA_SOURCE,
35 "user_query": "Tell me updates on langchain and llamaindex connectors based on the latest documentation",
36 "structured_request": KEYWORDS_DATE_DATA_SOURCE_ANSWER
37 }
38]
Let us define a utility function that we will use to process our filters before returning.
1def enforce_constraints(input_json):
2 def process_value(value):
3 if isinstance(value, (str, int)):
4 return value
5 elif isinstance(value, list) and all(isinstance(item, str) for item in value):
6 return value
7 elif isinstance(value, dict) and 'date' in value and isinstance(value['date'], str):
8 return value['date']
9 else:
10 raise ValueError("Invalid value type")
11
12 def process_dict(d):
13 if not isinstance(d, dict):
14 return d
15 processed_dict = {}
16 for k, v in d.items():
17 if k.startswith("$") and isinstance(v, list):
18 # Handling $and and $or conditions
19 processed_dict[k] = [process_dict(item) for item in v]
20 elif k.startswith("$"):
21 processed_dict[k] = process_value(v)
22 else:
23 processed_dict[k] = process_dict(v)
24 return processed_dict
25
26 return process_dict(input_json)
Now, let's go ahead and define our generate_metadata_filter function that we will be using to generate our metadata filters.
1from langchain.chains.query_constructor.base import load_query_constructor_runnable
2from langchain_community.query_constructors.mongodb_atlas import MongoDBAtlasTranslator
3
4translator = MongoDBAtlasTranslator()
5allowed_operators = translator.allowed_operators
6allowed_comparators = translator.allowed_comparators
7
8def generate_metadata_filter(query, llm, document_content_description, metadata_field_info, schema_prompt, examples):
9 """
10 This method will use the query constructor and generate the pre-filters for a list of datasets.
11 :param query: query submitted by the user
12 :param llm: llm instance
13 :param document_content_description: Data description
14 :param metadata_field_info: metadata fields information
15 :param schema_prompt: prompt instructions that will be passed to the llm
16 :param examples: list of examples
17 :return (dict): Returns pre-filter and new query for each dataset.
18 """
19 query_constructor = load_query_constructor_runnable(
20 llm=llm,
21 document_contents=document_content_description,
22 attribute_info=metadata_field_info,
23 schema_prompt=schema_prompt,
24 examples=examples,
25 allowed_comparators=allowed_comparators,
26 allowed_operators=allowed_operators
27 )
28 query = f"""Answer the below question:\n
29 Question: {query}
30 """
31 structured_query = query_constructor.invoke(query)
32 new_query, new_kwargs = translator.visit_structured_query(structured_query)
33 pre_filter = enforce_constraints(new_kwargs)
34 pre_filter = pre_filter = pre_filter.get("pre_filter", {}) if pre_filter else {}
35 return pre_filter, new_query
Let us define our LLM object for text generation. We will use LangChain’s ChatOpenAI for this purpose. We will use the gpt-4o model for our filter generation.
1from langchain_openai import ChatOpenAI
2
3llm = ChatOpenAI(openai_api_key=openai_api_key,
4 openai_api_base=openai_api_base,
5 default_headers=default_headers,
6 model="gpt-4o")
Now that we have everything we need to generate the metadata filter, let’s give it a try.
1query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
2
3pre_filter, new_query = generate_metadata_filter(
4 query=query,
5 llm=llm,
6 document_content_description=document_content_description,
7 metadata_field_info=metadata_field_info,
8 schema_prompt=DEFAULT_SCHEMA_PROMPT,
9 examples=DEFAULT_EXAMPLES)
10
11print(pre_filter)
12print(new_query)
13
14vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
15
16docs = vectorStore.similarity_search(query=query, pre_filter=pre_filter)
17for doc in docs:
18 print(doc.page_content)
Output:
Generated filter and new query:
1{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}
2I want to watch a movie with the latest release date
As you can see in the generated filter, we were able to extract the information from the user’s query and the metadata, such as “release before year 2000” and “animation genre,” that can be used to pre-filter the data before running the semantic search.
Note that we are also returning a new query after removing the filters that we generated. This will be helpful in the next stage of filter generation.
O/P docs:
1Toys come alive and have a blast doing so
2The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.
We have received two documents in the output which is a better result than before. But we are still not able to get the “latest release date” movie.
And, to find the latest movie, we will need to query our data so we will move to Stage 2, filter generation.

Stage 2: Time-based filtering

The purpose of this stage is to generate the filters that can be used to find the movies with the latest release date.
Note: We will use the filter generated in the Stage 1 in this stage because we want to find the movie with the latest release date in movies released before 2000 and in the animation category.
We will need to query our MongoDB collection via LLM. For this purpose, we will be defining some tools that the LLM can use to query our data.

Tools

Let us define the tools to allow LLM to query our MongoDB collection.
1import json
2import logging
3import os
4import traceback
5from typing import Dict, Optional, Type, Union, List
6
7from pymongo import MongoClient
8from langchain_core.callbacks import CallbackManagerForToolRun
9from langchain_core.pydantic_v1 import BaseModel, Field
10from langchain_core.tools import BaseTool
11
12class MongoDBClient:
13 """Data helper for querying MongoDB Vector Indexes."""
14
15 def __init__(self, collection):
16 self.collection = collection
17
18 def run_aggregate_pipeline(self, pipeline: List[Dict]) -> List[Dict]:
19 documents = list(self.collection.aggregate(pipeline))
20 return documents
21
22class BaseMongoDBTool(BaseModel):
23 """Base tool for interacting with MongoDB."""
24
25 client: MongoDBClient = Field(exclude=True)
26 match_filter: dict = Field(exclude=True)
27
28 class Config(BaseTool.Config):
29 pass
30
31class _QueryExecutorMongoDBToolInput(BaseModel):
32 pipeline: str = Field(..., description="A valid MongoDB pipeline in JSON string format")
33
34class QueryExecutorMongoDBTool(BaseMongoDBTool, BaseTool):
35 name: str = "mongo_db_executor"
36 description: str = """
37 Input to this tool is a mongodb pipeline, output is a list of documents.
38 If the pipeline is not correct, an error message will be returned.
39 If an error is returned, report back to the user the issue and stop.
40 """
41 args_schema: Type[BaseModel] = _QueryExecutorMongoDBToolInput
42
43 def _run(
44 self,
45 pipeline: str,
46 run_manager: Optional[CallbackManagerForToolRun] = None,
47 ) -> Union[List[Dict], str]:
48 """Get the result for the mongodb pipeline."""
49 try:
50 pipeline = json.loads(pipeline)
51 if self.match_filter:
52 pipeline = [{"$match": self.match_filter}] + pipeline
53 print(f"Updated pipeline: {pipeline}/")
54 documents = self.client.run_aggregate_pipeline(pipeline)
55 return documents
56 except Exception as e:
57 """Format the error message"""
58 return f"Error: {e}\n{traceback.format_exc()}"

Prompt definition

In the second stage, we are only accounting for use cases where the user wants "latest," "recent," "first," or "last" type of queries. We will instruct our LLM to only generate an aggregation pipeline to generate filters for these types of queries.
1SYSTEM_PROMPT_TEMPLATE = """
2Your goal is to structure the user's query to match the request schema provided below.
3
4<< Structured Request Schema >>
5When responding use a markdown code snippet with a JSON object formatted in the following schema:
6
7```json
8{{{{
9 "query": string \\ rewritten user's query after removing the information handled by the filter
10 "filter": string \\ logical condition statement for filtering documents
11}}}}
The query string should be re-written. Any conditions in the filter should not be mentioned in the query as well.
A logical condition statement is composed of one or more comparison and logical operation statements.
A comparison statement takes the form: comp(attr, val):
  • comp ('eq | ne | gt | gte | lt | lte | in | nin'): comparator
  • attr (string): name of attribute to apply the comparison to
  • val (string): is the comparison value
A logical operation statement takes the form op(statement1, statement2, ...):
  • op ('and | or'): logical operator
  • statement1, statement2, ... (comparison statements or logical operation statements): one or more statements to apply the operation to
First step is to think about whether the user question mentions anything about date or time related that require a lookup in the MongoDB database. Words like "latest", "recent", "earliest", "first", "last" etc. in the query means a look up could be required. If no lookup is required, return "NO_FILTER" for the filter value.
If required, create a syntactically correct MongoDB aggregation pipeline using '$sort' and '$limit' operator to run. Use projection to only fetch the relevant date columns. Then look at the results of the aggregation pipeline and generate a date range query that can be used to filter relevant documents from the collection.
Make sure to only generate date-based filters. Make sure to only generate the query if a user asks about a time based question such as latest, most recent and not mention a specific date time. Make sure that you only use the comparators and logical operators listed above and no others. Make sure that filters only refer to date/time attributes that exist in the data source. Make sure that filters only use the attributed names with its function names if there are functions applied on them. Make sure that filters only use format YYYY-MM-DD when handling date data typed values. Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored. Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value. Make sure the column names in the filter query are in double quotes.
<< Data Source >>
1{{{{
2 "content": {content_description},
3 "attributes": {attribute_info}
4}}}}
"""
1####
2
3Now, let's go ahead and define our `generate_time_based_filter` function that we will be using to generate our time-based filters.
4
5We have handled the cases where only one stage filter generation is required or no filter generation is required.
6
7```python
8from typing import Tuple
9
10from langchain.agents import create_tool_calling_agent, AgentExecutor
11from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \
12 SystemMessagePromptTemplate
13from pymongo.collection import Collection
14from langchain.chains.query_constructor.base import AttributeInfo, _format_attribute_info
15
16
17from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder, PromptTemplate, HumanMessagePromptTemplate, \
18 SystemMessagePromptTemplate
19from pymongo.collection import Collection
20from langchain.chains.query_constructor.base import AttributeInfo
21
22
23def generate_time_based_filter(llm: ChatOpenAI, collection: Collection, pre_filter: Dict, query: str, document_content_description: str, metadata_field_info: List[AttributeInfo]) -> Tuple[Dict, str]:
24 """
25 This function is responsible for generating filter query for "most recent", "latest", "earliest" type of user
26 questions.
27 :param llm: (ChatOpenAI) llm instance
28 :param collection: pymongo collection instance
29 :param pre_filter: (Dict) metadata pre-filter query
30 :param query: (str) user query
31 :param document_content_description: (str) description of data
32 :param metadata_field_info: (List[AttributeInfo]) list of metadata attributes information
33 :return: (Tuple[Dict, str]) time-based filter query and re-written user query
34 """
35 client = MongoDBClient(collection=collection)
36 executor_tool = QueryExecutorMongoDBTool(client=client, match_filter=pre_filter)
37 tools = [executor_tool]
38 attribute_str = _format_attribute_info(metadata_field_info)
39 system_prompt_template = SYSTEM_PROMPT_TEMPLATE.format(attribute_info=attribute_str,
40 content_description=document_content_description)
41
42 prompt = ChatPromptTemplate(input_variables=["agent_scratchpad", "input"],
43 messages=[SystemMessagePromptTemplate(
44 prompt=PromptTemplate(input_variables=[], template=system_prompt_template)),
45 MessagesPlaceholder(variable_name="chat_history", optional=True),
46 HumanMessagePromptTemplate(
47 prompt=PromptTemplate(input_variables=["input"],
48 template="{input}")),
49 MessagesPlaceholder(variable_name="agent_scratchpad")])
50
51 agent = create_tool_calling_agent(llm, tools, prompt)
52 agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
53 structured_query = agent_executor.invoke({"input": query})
54 allowed_attributes = []
55 for ainfo in metadata_field_info:
56 allowed_attributes.append(
57 ainfo.name if isinstance(ainfo, AttributeInfo) else ainfo["name"]
58 )
59
60 output_parser = StructuredQueryOutputParser.from_components(
61 allowed_comparators=translator.allowed_comparators,
62 allowed_operators=translator.allowed_operators,
63 allowed_attributes=allowed_attributes
64 )
65 structured_query = output_parser.parse(structured_query["output"])
66 new_query, new_kwargs = translator.visit_structured_query(structured_query)
67 time_based_pre_filter = enforce_constraints(new_kwargs)
68 time_based_pre_filter = time_based_pre_filter.get('pre_filter', {}) if time_based_pre_filter else {}
69 return time_based_pre_filter, new_query
70```
71
72#### Metadata and time-based filter generation
73
74Let’s run with both the filters now and check the result.
75
76```python
77query = "I want to watch a movie released before year 2000 in the animation genre with the latest release date"
78
79pre_filter, new_query = generate_metadata_filter(
80 query=query,
81 llm=llm,
82 document_content_description=document_content_description,
83 metadata_field_info=metadata_field_info,
84 schema_prompt=DEFAULT_SCHEMA_PROMPT,
85 examples=DEFAULT_EXAMPLES)
86
87print(pre_filter)
88print(new_query)
89
90time_based_pre_filter, final_query = generate_time_based_filter(
91 llm=llm,
92 collection=collection,
93 pre_filter=pre_filter,
94 query=new_query,
95 document_content_description=document_content_description,
96 metadata_field_info=metadata_field_info
97)
98print(time_based_pre_filter)
99print(final_query)
100```
101
102Output:
103
104```python
105{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}
106I want to watch a movie with the latest release date
107
108{'release_date': {'$eq': '1999-11-24'}}
109I want to watch a movie
110```
111
112Now that we have generated both stages’ filters, we can combine them using the `$and` operator to generate our final filter.
113
114```python
115# initialize final filter with stage 1 filter
116final_pre_filter = pre_filter
117if time_based_pre_filter:
118 # add time_based_filter if applicable
119 final_pre_filter = {"$and": [pre_filter, time_based_pre_filter]}
120
121print(final_pre_filter)
122```
123
124Output:
125
126```python
127{'$and': [{'$and': [{'release_date': {'$lt': '2000-01-01'}}, {'genre': {'$in': ['animation']}}]}, {'release_date': {'$eq': '1999-11-24'}}]}
128```
129
130Now, let’s run a semantic search using the `final_pre_filter`.
131
132```python
133vectorStore = MongoDBAtlasVectorSearch( collection, embeddings )
134
135docs = vectorStore.similarity_search(query=query, pre_filter=final_pre_filter)
136for doc in docs:
137 print(doc.page_content)
138```
139
140Output:
141
142```bash
143The toys embark on a rescue mission to save Woody after he is stolen by a toy collector.
144```
145
146With smart filtering, we used both metadata and time-based filtering stages and were able to generate a filter that can be used to pre-filter the data before running a semantic search. We have received only the required documents in the end.
147
148## Benefits of smart filtering
149
150Smart filtering brings a host of advantages to the table, making it a valuable tool for enhancing search experiences:
151
152* **Improved search accuracy:** By precisely targeting the data that matches your query, smart filtering dramatically increases the likelihood of finding relevant results. No more wading through irrelevant information.
153* **Faster search results:** Since smart filtering narrows down the search scope, the system can process information more efficiently, leading to quicker results.
154* **Enhanced user experience:** When users find what they're looking for quickly and easily, it leads to higher satisfaction and a better overall experience.
155* **Versatility:** Smart filtering can be applied to various domains, from e-commerce product searches to content recommendations, making it a versatile tool.
156
157By leveraging metadata and creating targeted pre-filters, smart filtering empowers you to deliver search results that truly meet user expectations.
158
159## Conclusion
160
161Smart filtering is a powerful tool that transforms search experiences by bridging the gap between user intent and search results. By harnessing the power of metadata and vector search, it delivers more accurate, relevant, and efficient search outcomes.
162
163Whether you're building an e-commerce platform, a content recommendation system, or any application that relies on effective search, incorporating smart filtering can significantly enhance user satisfaction and drive better results.
164
165By understanding the fundamentals of smart filtering, you're equipped to explore its potential and implement it in your projects. So why wait? Start leveraging the power of smart filtering today and revolutionize your search game\!
166
167View the [full source code](https://github.com/bhardwaj-vipul/SmartFilteringRAG) for smart filtering using MongoDB Atlas.
168
169Check out additional resources: [Unlock the Power of Semantic Search With MongoDB Atlas Vector Search](https://www.mongodb.com/basics/semantic-search) and [Interactive RAG With MongoDB Atlas \+ Function Calling API](https://www.mongodb.com/developer/products/atlas/interactive-rag-mongodb-atlas-function-calling-api/). If you have any questions or want to show us what you are building, join us in the [MongoDB Community Forums](https://www.mongodb.com/community/forums/).
170
171
172 [1]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt07d47760e8c04d15/66d6dd2326adee3c5af681ed/image2.png
173 [2]: https://images.contentstack.io/v3/assets/blt39790b633ee0d5a7/blt6ce1fa4c1376e507/66d6dd233bf41e1189a41469/image1.png
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Quickstart

MongoDB With Bedrock Agent: Quick Tutorial


Jul 01, 2024 | 6 min read
Tutorial

Getting Started With MongoDB Atlas Serverless, AWS CDK, and AWS Serverless Computing


Aug 09, 2024 | 18 min read
Tutorial

How to Migrate Your Flask App From SQL To MongoDB


Jun 28, 2024 | 9 min read
Tutorial

Confessions of a PyMongoArrowholic: Using Atlas Vector Search and PyMongoArrow to Semantically Search Through Luxury Fashion Items


Aug 09, 2024 | 9 min read
Table of Contents