EventoObtenha 50% de desconto no seu ingresso para MongoDB.local Londres em outubro 2. Use o código WEB50Saiba mais >>
Desenvolvedor MongoDB
Central de desenvolvedor do MongoDBchevron-right
Produtoschevron-right
Atlaschevron-right

Building an Advanced RAG System With Self-Querying Retrieval

Apoorva Joshi, Maria Khalusova21 min read • Published Sep 12, 2024 • Updated Sep 12, 2024
IAPythonAtlas
Ícone do FacebookÍcone do Twitterícone do linkedin
Avalie esse Tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Imagine you are one of the developers responsible for building a product search chatbot for an e-commerce platform. You have seen all this talk about semantic search (vector) and Retrieval Augmented Generation (RAG), so you created a RAG chatbot that uses semantic search to help users search through your product catalog using natural language. A few days after you deploy this feature, your support team reports complaints from users not being able to find what they are looking for. One user reports searching for “Size 10 Nike running shoes priced under $150” and scrolling through pages of casual, out-of-budget, or non-Nike shoes before finding what they need.
This results from a limitation of embeddings in that they might not capture specific keywords or criteria from user queries. In this tutorial, we will look into some scenarios where vector search alone is inadequate and see how to improve them using a technique called self-querying retrieval.
Specifically, in this tutorial, we will cover the following:
  • Extracting metadata filters from natural language
  • Combining metadata filtering with vector search
  • Getting structured outputs from LLMs
  • Building a RAG system with self-querying retrieval

What is metadata? Why is it important for RAG?

Metadata is information that describes your data. For products on an e-commerce marketplace, metadata might include the product ID, brand, category, size options, etc. Extracting the right metadata while processing raw data for your RAG applications can have several advantages downstream.
In MongoDB, the extracted metadata can be used as pre-filters to restrict the vector search based on criteria that might not be accurately captured via embeddings, such as numeric values, dates, categories, names, or unique identifiers such as product IDs. This leads to more relevant results being retrieved from semantic search and, consequently, more accurate responses from your RAG application.

Extracting metadata filters from natural language

Users typically interact with LLM applications using natural language. In the case of RAG, retrieving relevant documents from a knowledge base via vector search alone simply requires embedding the natural language user query. However, say you want to apply metadata filters to the vector search. In that case, this requires the additional step of extracting the right metadata from the user query and generating accurate filters to use along with vector search. This is where self-querying retrieval comes into play.
Self-querying retrieval, in the context of LLM applications, is a technique that uses LLMs to generate structured queries from natural language input and uses them to retrieve information from a knowledge base. In MongoDB, this involves the following steps:
  • Extracting metadata from natural language
  • Indexing the metadata fields
  • Generating accurate filter definitions containing only supported MongoDB Query API match expressions and operators
  • Executing the vector search query with the pre-filters

Building a RAG system with self-querying retrieval

In this tutorial, we will build a self-querying RAG system for an investment assistant. The goal of the assistant is to help investors make informed decisions by answering questions about financial standing, earnings, etc. for companies they are interested in. Investors might have questions about specific companies and fiscal years, but this information is difficult to capture using embeddings, especially after chunking. To overcome this, we will build self-querying retrieval into the assistant’s workflow as follows:
Self-querying RAG architecture
Given a user question, an LLM first tries to extract metadata from it according to a specified metadata schema. If the specified metadata is found, an LLM generates a filter definition that gets applied as a pre-filter during vector search. Otherwise, the user query alone is used to retrieve results using vector search. Finally, the retrieved context and user question, along with any other prompts, are passed to an LLM to generate an answer.

Antes de começarmos

Dados

We will use 10-K filings from 30 Fortune 500 companies — the raw PDFs are available in a Google Drive folder. These reports, required by the U.S. Securities and Exchange Commission (SEC), offer a deep dive into a company's financial performance, including financial history, statements, earnings per share, and other crucial data points. For investors, 10-Ks are invaluable tools for making informed decisions. You can easily access and download these reports by visiting the website of any publicly traded US company.

Ferramentas

We will use Unstructured to chunk, embed, and extract metadata from PDF documents and prepare them for RAG. We will use MongoDB as the vector store for our application. Finally, we will use LangGraph to orchestrate our RAG system as a stateful, DAG-like workflow.

Onde está o código?

O Jupyter Notebook para este tutorial pode ser encontrado no GitHub.

Step 1: Decide what metadata to extract

The first step of self-querying retrieval is to decide what metadata to extract from your raw data. Doing this exercise before ingesting your data into a vector database ensures that all the required metadata is captured in the ingested documents, and the corresponding fields indexed. Not doing so beforehand can result in errors and inefficiencies downstream due to the LLM not extracting the right metadata from user queries, using wrong field names when generating the pre-filters, or querying for unindexed fields.
A good way to determine what metadata to extract is working backward from your best guess of the types of queries users might ask of your system. For our investment assistant, for example, we expect users to ask questions about specific companies and years. So this is the metadata we will start with. MongoDB’s flexible document model means you can always update existing documents with new metadata fields based on observations from real user interactions.

Etapa 2: instalar as bibliotecas necessárias

Vamos precisar das seguintes bibliotecas para este tutorial:
  • langgraph: Python package to build stateful, multi-actor applications with LLMs
  • openai: Python package to interact with OpenAI APIs
  • pymongo: Python package to interact with MongoDB databases and collections
  • sentence-transformers: Python package for open-source language models
  • unstructured-ingest: Python package for data processing using Unstructured

Step 3: Set up prerequisites

In this tutorial, we will use Unstructured’s Serverless API to prepare our dataset for RAG. The Serverless API is a production-grade API for processing 25+ different unstructured file types, including PDFs. To use the API, first sign up for a free 14-day trial. Once you sign up, you will get access to a personalized dashboard where you will find your API key and the API endpoint URL. Set these in your code:
We will use MongoDB Atlas as the vector store for our RAG application. But first, you will need a MongoDB Atlas account with a database cluster. Once you do that, you will need to get the connection string to connect to your cluster. Follow these steps to get set up:
Once you have the connection string, set it in your code. Also, set the name of the database (MONGODB_DB_NAME) and collection (MONGODB_COLLECTION) you want to ingest data into. Finally, instantiate the MongoDB client to connect to your database:
Não se lembre de adicionar o IP da sua máquina host à lista de acesso IP do seu cluster.
Unstructured supports loading documents from over 20 different sources, including AWS S3, Azure Storage, and Google Cloud Storage. For this tutorial, we have downloaded the raw PDFs into an S3 bucket, but you can also download them to a local directory and use them as your data source.
If you are using an S3 bucket as your data source, you will need to set the S3 URI and your AWS authentication credentials in the code as well. If using a local directory as the data source, skip this step:
Let’s also specify the embedding model and LLM to use:
We will use BAAI’s open-source bge-base-en-v1.5 embedding model from Hugging Face. The model produces embeddings with 768 dimensions. As for the LLM, we will use the latest version of GPT-4o throughout the tutorial.

Step 4: Partition, chunk, and embed PDF files

Now that we have set all our prerequisites, let’s define a PDF processing pipeline using the unstructured-ingest library.
The pipeline above is constructed from multiple configs that define different aspects of its behavior:
  • ProcessorConfig: This defines the general parameters of the pipeline's behavior such as logging, cache location (this one is important in this example), re-processing, etc.
  • S3IndexerConfig, S3DownloaderConfig, and S3ConnectionConfig: These are the configs for the AWS S3 source connector.
  • PartitionerConfig: This partitions the documents into standardized JSON containing document elements and metadata. The hi_res strategy means that the Unstructured API will employ OCR, document understanding models, and classical ML methods to extract and classify document elements. This strategy is recommended for complex PDFs that contain more than just text, e.g. tables and images. Learn more about partitioning strategies.
  • ChunkerConfig: Once all of the documents are partitioned, the next step is to chunk them. The parameters in this config control the chunking behavior. We use the by_title chunking strategy to keep document sections together until the chunk size (chunk_max_characters) is reached. We specify a chunk_overlap to maintain continuity at the chunk boundaries.
  • EmbedderConfig: The final processing step is to embed the chunks. We will use the bge-base-en-v1.5 model via the langchain-huggingface integration.
  • LocalUploaderConfig: This config allows us to store the final results locally in the specified directory.
Once the pipeline finishes running, we will have the final results in *.json files in the /content/ingest-outputs local directory, as well as cached results for each of the pipeline steps in the WORK_DIR directory.
If using a local directory as the data source, uncomment the local connector imports and configs in the code and comment out the S3 connector imports and configs instead.

Step 5: Add custom metadata to the processed documents

In this step, we want to add the company name and fiscal year as custom metadata to each chunk, to enable pre-filtering during vector search. Luckily, the Form-10K documents have a more or less standard page with this information, so we can use simple regular expressions (regex) to extract the required metadata from them.
Let’s first define a function to extract the year:
The Form-10K documents are filed for a specific fiscal year, and there’s a standard sentence in each document that indicates what year the document was filed for, e.g., “For the fiscal year ended December 31, 2023.” The function above takes the elements of a document as input, uses a regex to find the line containing the filing date, and extracts the year from this line. We return the year as int to be able to leverage comparison operators later.
Let’s define another function to extract the company name:
Extracting the company name is similar to how we extracted the fiscal year, except there is a little more variation in where exactly the company name is mentioned in the document, so we employ multiple regex expressions. In most cases, you will find the company name right after the line, “Exact name of registrant as specified.” However, in some cases, the company name precedes this line or follows the “Commission File Number [Number]” instead.
Now, let’s walk through the directory with the embedding results, and for each document, find the company name and year and add them as custom metadata to all elements of the document:
The code above reads *.json files from the working directory that contains the results of the embedding step of the pipeline. Each file is the result of processing a single document, so you have one JSON file for each original PDF document. For each file, we’re using the previously defined functions to extract the fiscal date and company name and add these values as custom metadata fields to all of the document elements. We then write the results back to the original file.

Step 6: Write the processed documents to MongoDB

To write the final processed documents to MongoDB, we will need to re-run the pipeline we defined in Step 4, except we will now change the destination from local to MongoDB.
The pipeline above is the same as we had in Step 4, the only difference being the destination location for the data processing results, which is now the MongoDB collection specified in Step 2. The MongoDB destination is set up using three configurations: MongoDBConnectionConfig, MongoDBUploadStagerConfig, and MongoDBUploaderConfig. These configurations establish the connection to our MongoDB Atlas database, specify where the data needs to be uploaded, and note any additional upload parameters such as batch_size to bulk ingest the data into MongoDB.
An example document ingested into MongoDB is as follows:
Example MongoDB document
Note the custom metadata fields metadata.custom_metadata.company and metadata.custom_metadata.year containing the company name and year that the document was published in.

Step 7: Define graph state

As mentioned previously, we will use LangGraph to orchestrate our RAG system. LangGraph allows you to build LLM systems as graphs. The nodes of the graph are functions or tools to perform specific tasks, while the edges define routes between nodes — these can be fixed, conditional, or even cyclic. Each graph has a state which is a shared data structure that all the nodes can access and make updates to. You can define custom attributes within the state depending on what parameters you want to track across the nodes of the graph.
Let’s go ahead and define the state of our graph:
O código acima:
  • Creates a GraphState class that represents the state of the graph for our assistant.
  • Defines the state schema using the TypedDict model to create type-safe dictionaries that can help catch errors at compile-time rather than runtime.
  • Annotates the memory key in the state with the add_messages reducer function, telling LangGraph to append new messages to the existing list, rather than overwriting it.

Step 8: Define graph nodes

Next, let’s define the nodes of our graph. Nodes contain the main logic of the system. They are essentially Python functions and may or may not use an LLM. Each node takes the graph state as input, performs some computation, and returns an updated state.
Our assistant has four main functionalities:
  • Extract metadata from a natural language query.
  • Generate a MongoDB Query API filter definition.
  • Retrieve documents from MongoDB using semantic search.
  • Generate an answer to the user's question.
Let’s define nodes for each of the above functionalities.

Metadata extraction

For our assistant, extracting metadata in a specific format is crucial to avoid errors and incorrect results downstream. OpenAI recently released Structured Outputs, a feature that ensures their models will always generate responses that adhere to a specified schema. Let’s try this out!
Let’s create a Pydantic model that defines the metadata schema:
In the above code, we create a Pydantic model called Metadata with company and year as attributes. Each attribute is of type List[str], indicating a list of strings.
Now, let’s define a function to extract metadata from the user question:
O código acima:
  • Gets the user question from the graph state (state).
  • Defines a system prompt instructing the LLM to extract company names and date ranges from the question. We specifically ask the LLM to match company names to the list of companies in our dataset. Otherwise, the pre-filter won’t match any records in our knowledge base.
  • Passes the Metadata Pydantic model as the response_format argument to the parse() method of OpenAI’s Chat Completions API to get the LLM’s response back as a Metadata object.
  • Sets the metadata attribute of the graph state to {}, if no metadata is extracted by the LLM.
  • Sets the metadata attribute to a dictionary consisting of keys corresponding to metadata fields in our MongoDB documents — i.e., metadata.custom_metadata.company and metadata.custom_metadata.year — and values obtained from the LLM’s response, if metadata is found.

Filter generation

Next, let’s define a function to generate the filter definition:
O código acima:
  • Extracts metadata from the graph state.
  • Defines a system prompt containing instructions on how to create the filter definition — match expressions and operators supported, fields to use, etc.
  • Gets the output of the specified LLM as a JSON object by passing json_object as the response_format argument to the Chat Completions API call as well as specifying the word JSON in the system prompt.
  • Sets the filter attribute of the graph state to the generated filter definition.
The next step is to define a graph node that retrieves documents from MongoDB using semantic search. But first, you will need to create a vector search index:
O código acima:
  • Specifies the collection to perform semantic search against.
  • Specifies the name of the vector search index.
  • Specifies the vector search index definition which contains the path to the embeddings field in the documents (path), the number of embedding dimensions (numDimensions), the similarity metric to find nearest neighbors (similarity), and the path to the filter fields to index.
  • Creates the vector search index.
Now, let’s define a function that performs vector search:
O código acima:
  • Reads the user question and filter definition from the graph state.
  • Sets it to an empty object, if no filter is found. Remember, we skip directly to vector search if no metadata is extracted.
  • Embeds the user query using the bge-base-en-v1.5 model.
  • Creates and runs an aggregation pipeline to perform semantic search. The $vectorSearch stage retrieves documents using nearest neighbor search — the numCandidates field indicates how many nearest neighbors to consider, the limit field indicates how many documents to return, and the filter field specifies the pre-filters. The $project stage that follows outputs only the specified fields — i.e., text and the vector search similarity score from the returned documents.
  • Filters out documents with a similarity score of < 0.8 to ensure only highly relevant documents are returned.
  • Concatenates the text field values from the final list of documents into a string and sets it as the context attribute of the graph state.

Answer generation

Let’s define the final node in our graph to generate answers to questions.
O código acima:
  • Reads the user question, context retrieved from vector search, and any conversational history from the graph state.
  • Defines a system prompt to instruct the LLM to only answer based on the provided context. This is a simple step to mitigate hallucinations.
  • Passes the system prompt, retrieved context, chat history, and user question to OpenAI’s Chat Completions API to get back a response from the specified LLM.
  • Updates the memory attribute of the graph state by adding the retrieved context and LLM response to the existing list of messages.

Step 9: Define conditional edges

As mentioned previously, edges in a LangGraph graph define interactions between nodes. Our graph mostly has fixed edges between nodes, except one conditional edge to skip filter generation and go directly to vector search based on whether or not metadata was extracted from the user query.
Let’s define the logic for this conditional edge:
In the above code, we extract the metadata attribute of the graph state. If no metadata is found, it returns the string “vector_search.” If not, it returns the string “generate_filter.” We will see how to tie the outputs of this routing function to nodes in the next step.

Step 10: Build the graph

Finally, let’s build the graph by connecting the nodes using edges.
O código acima:
  • Instantiates the graph using the StateGraph class and parameterizes it with the graph state defined in Step 7.
  • Initializes an in-memory checkpointer using the MemorySaver() class. Checkpointers persist the graph state between interactions in LangGraph. This is how our RAG system persists chat history.
  • Adds all the nodes defined in Step 8 to the graph using the add_node method. The first argument to add_node is the name of the node and the second argument is the Python function corresponding to the node.
  • Adds fixed edges using the add_edge method. Notice the edges between STARTand extract_metadata, and generate_answer and END. START and END are special nodes to indicate which node to call first and which nodes have no following actions.
  • Adds conditional edges from the extract_metadata node to the generate_filter and vector_search nodes based on the outputs of the routing function defined in Step 9.
  • Compiles the graph with the memory checkpointer.
We can visualize our graph as a Mermaid diagram. This helps to verify the nodes and edges, especially for complicated workflows.
In the above code, we use the get_graph() method to get the compiled graph and the draw_mermaid_png method to generate a Mermaid diagram of the graph as a PNG image.
Here’s what our graph looks like:
Mermaid diagram of a LangGraph graph

Step 11: Execute the graph

As the last step, let’s define a function that takes user input and streams the outputs of the graph execution back to the user.
O código acima:
  • Takes the user question (question) and thread ID (thread_id) as input.
  • Creates an inputs dictionary containing the initial updates to the graph state. In our case, we want to add the user question to the question and memory attributes of the graph state.
  • Creates a runtime config containing the input thread ID. Thread IDs are unique IDs assigned to memory checkpoints in LangGraph. Each checkpoint has a separate state.
  • Streams the outputs of the graph, specifically the node names and outputs.
  • Prints the final answer from the system, extracted from the return value of the last node in the graph — i.e. the generate_answer node.
An example run of the system looks as follows:
Let’s ask a follow-up question to ensure that the graph persists chat history:
The system was able to recall what we asked previously.
There you go! We have successfully built a RAG system with self-querying retrieval and memory.

Conclusão

In this tutorial, we learned how to build an advanced RAG system that leverages self-querying retrieval. These techniques are useful in scenarios where user queries might contain important keywords that embeddings might not capture. We built an investment assistant that extracts company names and date ranges from user queries, uses them to generate MongoDB filter definitions, and applies them as pre-filters during vector search to get the most accurate information for an LLM to answer questions about the financial performance of specific companies during certain time periods.
For more advanced content, such as building agentic systems, check out the following tutorials on our Developer Center:
Principais comentários nos fóruns
Avatar do Comentarista do Fórum
Jack_WoehrJack Woehrlast week

oooh … gotta do this one … brb :grin:

Veja mais nos fóruns

Ícone do FacebookÍcone do Twitterícone do linkedin
Avalie esse Tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Relacionado
Tutorial

Como desenvolver um aplicativo web com funções sem servidor da Netlify e MongoDB


Aug 30, 2024 | 6 min read
Tutorial

Desenvolver um sistema de gerenciamento de estoque usando MongoDB Atlas


Aug 05, 2024 | 16 min read
Artigo

5 Maneiras de reduzir custos com o MongoDB Atlas


Sep 11, 2024 | 3 min read
Início rápido

Guia de início rápido para aplicativos RAG usando LangChain e LlamaIndex


Sep 18, 2024 | 10 min read
Sumário