Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain

Richmond Alake, Apoorva Joshi15 min read • Published Aug 13, 2024 • Updated Aug 13, 2024

AI Pandas Python Atlas

Rate this tutorial

Introduction

Retrieval-augmented generation (RAG) is an architectural design pattern prevalent in modern AI applications that provides generative AI functionalities. RAG has gained adoption within generative applications due to its additional benefit of grounding the responses and outputs of large language models (LLMs) with some relevant, factual, and updated information. The key contribution of RAG is the supplementing of non-parametric knowledge with the parametric knowledge of the LLM to generate adequate responses to user queries.

Modern AI applications that leverage LLMs and generative AI require more than effective response capabilities. AI engineers and developers should consider two other functionalities before moving RAG applications to production. Semantic caching and memory are two important capabilities for generative AI applications that extend the usefulness of modern AI applications by reducing infrastructure costs, response latency, and conversation storage.

Semantic caching is a process that utilizes a data store to keep a record of the queries and their results based on the semantics or context within the queries themselves.

This means that, as opposed to a traditional cache that caches data based on exact matches of data requests or specific identifiers, a semantic cache understands and leverages the meaning and relationships inherent in the data. Within an LLM or RAG application, this means that user queries that are both exact matches and contextually similar to any queries that have been previously cached will benefit from an efficient information retrieval process.

Take, for example, an e-commerce platform's customer support chatbot; integrating semantic caching enables the system to respond to inquiries by understanding the context behind user queries. So, whether a customer asks about the "best smartphone for night photography" or "a phone for night photos," the chatbot can leverage its semantic cache to pull relevant, previously stored responses, improving both the efficiency and relevance of its answers.

LLM-powered chatbot interfaces are now prevalent in generative AI applications. Still, the conversations held between LLM and application users must be stored and retrieved to create a coherent and contextually relevant interaction history. The benefits of having a reference of interaction history lie in providing additional context to LLMs, understanding previously held conversations, improving the personalization of GenAI applications, and enabling the chatbot to provide more accurate responses to queries.

MongoDB Atlas vector search capabilities enable the creation of a semantic cache, and the new LangChain-MongoDB integration makes integrating this cache in RAG applications easier. The LangChain-MongoDB integration also makes implementing a conversation store for interactions with RAG applications easier.

Here's what’s covered in this tutorial:

How to implement memory and storage of conversation history using LangChain and MongoDB
How to implement semantic cache using LangChain and MongoDB
Overview of semantic cache and memory utilization within RAG applications

The following GitHub repository contains all implementations presented in this tutorial, along with other use cases and examples of RAG implementations.

Step 1: Installing required libraries

This section guides you through the installation process of the essential libraries needed to implement the RAG application, complete with memory and history capabilities, within your current development environment. Here is the list of required libraries:

datasets: Python library to get access to datasets available on Hugging Face Hub
langchain: Python toolkit for LangChain
langchain-mongodb: Python package to use MongoDB as a vector store, semantic cache, chat history store, etc., in LangChain
langchain-openai: Python package to use OpenAI models with LangChain
pymongo: Python toolkit for MongoDB
pandas: Python library for data analysis, exploration, and manipulation

1 ! pip install -qU datasets langchain langchain-mongodb langchain-openai pymongo pandas

Do note that this tutorial utilizes OpenAI embedding and base models. To access the models, ensure you have an OpenAI API key.

In your development environment, create a reference to the OpenAI API key.

1 import getpass
2 OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key:")

Step 2: Database setup

To handle the requirements for equipping the RAG application with the capabilities of storing interaction or conversation history and a semantic cache, two new collections must be created alongside the collection that will hold the main application data.

Creating a database and collection within MongoDB is made simple with MongoDB Atlas.

Register a free Atlas account or sign in to your existing Atlas account.
Follow the instructions (select Atlas UI as the procedure) to deploy your first cluster.
Create the database: `langchain_chatbot`.
Within the database` langchain_chatbot`, create the following collections:
- data : Hold all data that acts as a knowledge source for the chatbot.
- history : Hold all conversations held between the chatbot and the application user.
- semantic_cache : Hold all queries made to the chatbot along with their LLM responses.
Create a vector search index named vector_index for the data collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of the data collection vector search index.

1     {
2       "fields": [
3         {
4           "numDimensions": 1536,
5           "path": "embedding",
6           "similarity": "cosine",
7           "type": "vector"
8         }
9       ]
10     }

6. Create a vector search index with a text filter named vector_index for the semantic_cache collection. This index enables the RAG application to retrieve responses to queries semantically similar to a current query asked by the application user. Below is the JSON definition of the semantic_cache collection vector search index.

1     {
2       "fields": [
3         {
4           "numDimensions": 1536,
5           "path": "embedding",
6           "similarity": "cosine",
7           "type": "vector"
8         },
9         {
10           "path": "llm_string",
11           "type": "filter"
12         }
13       ]
14     }

By the end of this step, you should have a database with three collections and two defined vector search indexes. The final step of this section is to obtain the connection URI string to the created Atlas cluster to establish a connection between the databases and the current development environment. Follow the steps to get the connection string from the Atlas UI.

In your development environment, create a reference to the MongoDB URI string.

1 MONGODB_URI = getpass.getpass("Enter your MongoDB connection string:")

Step 3: Download and prepare the dataset

This tutorial uses MongoDB’s embedded_movies dataset. A datapoint within the movie dataset contains information corresponding to a particular movie; plot, genre, cast, runtime, and more are captured for each data point. After loading the dataset into the development environment, it is converted into a Pandas data frame object, which enables data structure manipulation and analysis with relative ease.

1 from datasets import load_dataset
2 import pandas as pd
3 
4 data = load_dataset("MongoDB/embedded_movies")
5 df = pd.DataFrame(data["train"])
6 
7 # Only keep records where the fullplot field is not null
8 df = df[df["fullplot"].notna()]
9 
10 # Renaming the embedding field to "embedding" -- required by LangChain
11 df.rename(columns={"plot_embedding": "embedding"}, inplace=True)

The code above executes the following operations:

Import the load_dataset module from the datasets library, which enables the appropriate dataset to be loaded for this tutorial by specifying the path. The full dataset is loaded environment and referenced by the variable data.
Only the dataset's train partition is required to be utilized; the variable df holds a reference to the dataset training partition as a Pandas DataFrame.
The DataFrame is filtered to only keep records where the fullplot field is not null. This step ensures that any subsequent operations or analyses that rely on the fullplot field, such as the embedding process, will not be hindered by missing data. The filtering process uses pandas' notna() method to check for non-null entries in the fullplot column.
The column plot_embedding in the DataFrame is renamed to embedding. This step is necessary for compatibility with LangChain, which requires an input field named embedding.

By the end of the operations in this section, we have a full dataset that acts as a knowledge source for the chatbot and is ready to be ingested into the data collection in the langchain_chatbot database.

Step 4: Create a naive RAG chain with MongoDB Vector Store

Before adding chat history and caching, let’s first see how to create a simple RAG chain using LangChain, with MongoDB as the vector store. Here’s what the workflow looks like:

The user question is embedded, and relevant documents are retrieved from the MongoDB vector store. The retrieved documents, along with the user query, are passed as a prompt to the LLM, which generates an answer to the question.

Let’s first ingest data into a MongoDB collection. We will use this collection as the vector store for our RAG chain.

1 from pymongo import MongoClient
2 
3 # Initialize MongoDB python client
4 client = MongoClient(MONGODB_URI)
5 
6 DB_NAME = "langchain_chatbot"
7 COLLECTION_NAME = "data"
8 ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
9 collection = client[DB_NAME][COLLECTION_NAME]

The code above creates a MongoDB client and defines the database langchain_chatbot and collection data where we will store our data. Remember, you will also need to create a vector search index to efficiently retrieve data from the MongoDB vector store, as documented in Step 2 of this tutorial. To do this, refer to our official vector search index creation guide.

While creating the vector search index for the data collection, ensure that it is named vector_index and that the index definition looks as follows:

1     {
2       "fields": [
3         {
4           "numDimensions": 1536,
5           "path": "embedding",
6           "similarity": "cosine",
7           "type": "vector"
8         }
9       ]
10     }

NOTE: We set numDimensions to 1536 because we use OpenAI’s text-embedding-ada-002 model to create embeddings.

Next, we delete any existing documents from the `data` collection and ingest our data into it:

1 # Delete any existing records in the collection
2 collection.delete_many({})
3 
4 # Data Ingestion
5 records = df.to_dict('records')
6 collection.insert_many(records)
7 
8 print("Data ingestion into MongoDB completed")

Ingesting data into a MongoDB collection from a pandas DataFrame is a straightforward process. We first convert the DataFrame to a list of dictionaries and then utilize the insert_many method to bulk ingest documents into the collection.

With our data in MongoDB, let’s use it to construct a vector store for our RAG chain:

1 from langchain_openai import OpenAIEmbeddings
2 from langchain_mongodb import MongoDBAtlasVectorSearch
3 
4 # Using the text-embedding-ada-002 since that's what was used to create embeddings in the movies dataset
5 embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-ada-002")
6 
7 # Vector Store Creation
8 vector_store = MongoDBAtlasVectorSearch.from_connection_string(
9     connection_string=MONGODB_URI,
10     namespace=DB_NAME + "." + COLLECTION_NAME,
11     embedding= embeddings,
12     index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
13     text_key="fullplot"
14 )

We use the from_connection_string method of the MongoDBAtlasVectorSearch class from the langchain_mongodb integration to create a MongoDB vector store from a MongoDB connection URI. The get_connection_string method takes the following arguments:

connection_string: MongoDB connection URI
namespace: A valid MongoDB namespace (database and collection)
embedding: Embedding model to use to generate embeddings for a vector search
index_name: MongoDB Atlas vector search index name
text_key: Field in the ingested documents that contain the text

The next step is to use the MongoDB vector store as a retriever in our RAG chain. In LangChain, a retriever is an interface that returns documents given a query. You can use a vector store as a retriever by using the as_retriever method:

1 retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})

as_retriever can take arguments such as search_type — i.e., what metric to use to retrieve documents. Here, we choose similarity since we want to retrieve the most similar documents to a given query. We can also specify additional search arguments such as k — i.e., the number of documents to retrieve. In our example, we set it to 5, which means the 5 most similar documents will be retrieved for a given query.

The final step is to put all of these pieces together to create a RAG chain.

NOTE: Chains in LangChain are a sequence of calls either to an LLM, a tool, or a data processing step. The recommended way to compose chains in LangChain is using the LangChain Expression Language (LCEL). Each component in a chain is referred to as a Runnable and can be invoked, streamed, etc., independently of other components in the chain.

1 from langchain_openai import ChatOpenAI
2 from langchain_core.prompts import ChatPromptTemplate
3 from langchain_core.runnables import RunnablePassthrough
4 from langchain_core.output_parsers import StrOutputParser
5 
6 # Generate context using the retriever, and pass the user question through
7 retrieve = {"context": retriever | (lambda docs: "\n\n".join([d.page_content for d in docs])), "question": RunnablePassthrough()}
8 template = """Answer the question based only on the following context: \
9 {context}
10 
11 Question: {question}
12 """
13 # Defining the chat prompt
14 prompt = ChatPromptTemplate.from_template(template)
15 # Defining the model to be used for chat completion
16 model = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
17 # Parse output as a string
18 parse_output = StrOutputParser()
19 
20 # Naive RAG chain 
21 naive_rag_chain = (
22     retrieve
23     | prompt
24     | model
25     | parse_output
26 )

The code snippet above does the following:

Defines the retrieve component: It takes the user input (a question) and sends it to the retriever to obtain similar documents. It also formats the output to match the input format expected by the next Runnable, which in this case is a dictionary with context and question as keys. The RunnablePassthrough() call for the question key indicates that the user input is simply passed through to the next stage under the question key.
Defines the prompt component: It crafts a prompt by populating a prompt template with the context and question from the retrieve stage.
Defines the model component: This specifies the chat model to use. We use OpenAI — unless specified otherwise, the gpt-3.5-turbo model is used by default.
Defines the parse_output component: A simple output parser parses the result from the LLM into a string.
Defines a naive_rag_chain: It uses LCEL pipe ( | ) notation to chain together the above components.

Let’s test out our chain by asking a question. We do this using the `invoke()` method, which is used to call a chain on an input:

1 naive_rag_chain.invoke("What is the best movie to watch when sad?")
2 Output: Once a Thief

NOTE: With complex chains, it can be hard to tell whether or not information is flowing through them as expected. We highly recommend using LangSmith for debugging and monitoring in such cases. Simply grab an API key and add the following lines to your code to view traces in the LangSmith UI:

1     export LANGCHAIN_TRACING_V2=true
2     export LANGCHAIN_API_KEY=<your-api-key>

Step 5: Create a RAG chain with chat history

Now that we have seen how to create a simple RAG chain, let’s see how to add chat message history to it and persist it in MongoDB. The workflow for this chain looks something like this:

The user question and the chat message history (if available) are first used to craft a standalone question to retrieve documents from the vector store. The user question, along with the retrieved documents and the chat history, are then passed as inputs to an LLM to generate an answer.

Let’s first define a function to get the chat message history for a particular session from MongoDB:

1 from langchain_mongodb.chat_message_histories import MongoDBChatMessageHistory
2 from langchain_core.runnables.history import RunnableWithMessageHistory
3 from langchain_core.prompts import MessagesPlaceholder
4 
5 def get_session_history(session_id: str) -> MongoDBChatMessageHistory:
6         return MongoDBChatMessageHistory(MONGODB_URI, session_id, database_name=DB_NAME, collection_name="history")

The get_session_history function takes in a session_id and returns an instance of MongoDBChatMessageHistory, which contains the chat history for that session, retrieved from the history collection of the langchain_chatbot database defined previously.

Next, let’s define a question_chain Runnable, which takes the chat message history and a follow-up question as input and creates a standalone question that can be passed as input to the retriever.

1 standalone_system_prompt = """
2 Given a chat history and a follow-up question, rephrase the follow-up question to be a standalone question. \
3 Do NOT answer the question, just reformulate it if needed, otherwise return it as is. \
4 Only return the final standalone question. \
5 """
6 standalone_question_prompt = ChatPromptTemplate.from_messages(
7     [
8         ("system", standalone_system_prompt),
9         MessagesPlaceholder(variable_name="history"),
10         ("human", "{question}"),
11     ]
12 )
13 
14 question_chain = standalone_question_prompt | model | parse_output

The retriever chain here looks a bit different from the one in Step 5:

1 retriever_chain = RunnablePassthrough.assign(context=question_chain | retriever | (lambda docs: "\n\n".join([d.page_content for d in docs])))

Notice that the retriever chain here uses RunnablePassthrough.assign(...). This adds the additional arguments passed to the assign function to the user input before passing it to the next component. This example will add the context obtained from the retriever to the user input before passing it to the prompt.

The prompt, in this case, consists of the user’s question, the retrieved context, and the session’s chat history:

1 rag_system_prompt = """Answer the question based only on the following context: \
2 {context}
3 """
4 rag_prompt = ChatPromptTemplate.from_messages(
5     [
6         ("system", rag_system_prompt),
7         MessagesPlaceholder(variable_name="history"),
8         ("human", "{question}"),
9     ]
10 )

Finally, we create a RAG chain and wrap it under RunnableWithMessageHistory, which is a Runnable that manages chat message history for another Runnable, including updates.

1 # RAG chain
2 rag_chain = (
3     retriever_chain
4     | rag_prompt
5     | model
6     | parse_output
7 )
8 
9 # RAG chain with history
10 with_message_history = RunnableWithMessageHistory(
11     rag_chain,
12     get_session_history,
13     input_messages_key="question",
14     history_messages_key="history",
15 )

The parameters to the RunnableWithMessageHistory Runnable above are as follows:

rag_chain: Base runnable to wrap
get_session_history: The function to get chat message history from
input_messages_key: Key for input messages if the base runnable accepts a dictionary
history_messages_key: Key for historical messages if the base runnable expects a message history

Let’s test out our RAG chain with history by asking a question:

1 with_message_history.invoke({"question": "What is the best movie to watch when sad?"}, {"configurable": {"session_id": "1"}})
2 
3 Output: The best movie to watch when feeling down could be "Last Action Hero." It\'s a fun and action-packed film that blends reality and fantasy, offering an escape from the real world and providing an entertaining distraction.
4 
5 with_message_history.invoke({"question": "How about something more light?"}, {"configurable": {"session_id": "1"}})
6 
7 Output: For a lighter movie option, you might enjoy "Cousins." It\'s a comedy film set in Barcelona with action and humor, offering a fun and entertaining escape from reality. The storyline is engaging and filled with comedic moments that could help lift your spirits.

Note the arguments passed to the invoke calls above. The first argument is a dictionary with a question key containing the user input. This key should match the value of the input_messages_key argument passed to the RunnableWithMessageHistory Runnable. The second argument is used to create a new or look up an existing chat message history that matches the specified session_id.

Step 6: Get faster responses using semantic cache

Adding a semantic cache to the current RAG application is easy with the LangChain-MongoDB integration. This section covers the process of adding semantic caching to the RAG application, effectively reducing operational costs by storing the results of user questions and the questions themselves.

This allows the application to reference the results when the user asks a semantically similar question in the future, thereby reducing the number of API calls made to the LLM provider. The advantage of semantic caching is that it reduces both the latency and cost of the overall RAG application.

1 from langchain_mongodb.cache import MongoDBAtlasSemanticCache
2 from langchain_core.globals import set_llm_cache
3 
4 set_llm_cache(MongoDBAtlasSemanticCache(
5     connection_string=MONGODB_URI,
6     embedding=embeddings,
7     collection_name="semantic_cache",
8     database_name=DB_NAME,
9     index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
10     wait_until_ready=True # Optional, waits until the cache is ready to be used
11 ))

In the code above, the following operations are executed:

Import the module MongoDBAtlasSemanticCache, which provides the interface connecting to MongoDB and creating the semantic cache-store. The MongoDBAtlasSemanticCache provides methods to initialize the cache, look up cached responses based on query prompts and LLM strings, update the cache with new responses, and clear cached entries either entirely or based on specific criteria.
Import the module set_llm_cache, which sets the LLM cache for the current application environment.
Configure the semantic cache by passing an instance of the MongoDBAtlasSemanticCache with the appropriate argument as a parameter to the set_llm_cache function.

With the semantic cache now configured for our RAG application, it is possible to monitor the query execution times. This observation will highlight the efficiency gains and benefits created by implementing the semantic cache.

1 %%time
2 naive_rag_chain.invoke("What is the best movie to watch when sad?")

The code above invokes the RAG chain with the query, “What is the best movie to watch when sad?” The %%time command enables the measurement of the duration of a specific operation to be computed within a Jupyter Notebook or IPython environment.

As shown above, the first invocation of the chain with the query executes in 1.24 seconds. The response block below shows this.

1 CPU times: user 87.8 ms, sys: 670 µs, total: 88.5 ms
2 Wall time: 1.24 s
3 'Once a Thief'

The RAG chain invoked with the query should show a reduced response time to ensure the semantic cache is working as expected.

1 %%time
2 naive_rag_chain.invoke("Which movie do I watch when sad?")

The output below shows that the same query has a significantly reduced response time.

1 CPU times: user 43.5 ms, sys: 4.16 ms, total: 47.7 ms
2 Wall time: 255 ms
3 'Once a Thief'

NOTE: Semantic cache only caches the input to the LLM. When using it in retrieval chains, remember that documents retrieved can change between runs, resulting in cache misses for semantically similar queries.

Conclusion

The RAG architectural design pattern in generative AI and modern applications has been shown to provide some benefits in making RAG applications deliver more relevant and up-to-date answers to user queries.

However, a robust modern application requires more than just a naive RAG implementation. To enhance applications' interaction with users, factors such as personalization, history, and efficient response times are considerations that AI engineers and builders should consider as early as during the POC or demo phase of development.

This tutorial has demonstrated how straightforward it is to integrate semantic caching and memory into RAG applications when facilitated by MongoDB and LangChain. Step-by-step instructions have been provided to guide the implementation of a RAG application; the creation of databases, collections, and indexes; and the utilization of LangChain to develop a RAG chain and application.

Following the steps outlined in this tutorial, developers can leverage the MongoDB Atlas Vector Search capabilities and the seamless LangChain-MongoDB integration to create more responsive, efficient, and intelligent RAG applications. The combination of semantic caching and conversation history reduces operational costs and significantly enhances the accuracy and personalization of responses generated by LLMs.

All the implementation code presented in the tutorial can be accessed in the repository.

FAQs

1. What is retrieval-augmented generation (RAG)? RAG is a design pattern in AI applications that enhances the capabilities of large language models (LLMs) by grounding their responses with relevant, factual, and up-to-date information. This is achieved by supplementing LLMs' parametric knowledge with non-parametric knowledge, enabling the generation of more accurate and contextually relevant responses.

2. How does integrating memory and chat history enhance RAG applications? Integrating memory and chat history into RAG applications allows for the retention and retrieval of past interactions between the large language model (LLM) and users. This functionality enriches the model's context awareness, enabling it to generate responses that are relevant to the immediate query and reflect the continuity and nuances of ongoing conversations. By maintaining a coherent and contextually relevant interaction history, RAG applications can offer more personalized and accurate responses, significantly enhancing the user experience and the application's overall effectiveness.

3. Why is semantic caching important in RAG applications? Semantic caching stores the results of user queries and their associated responses based on the query's semantics. This approach allows for efficient information retrieval when semantically similar queries are made in the future, reducing API calls to LLM providers and lowering both latency and operational costs.

4. How does MongoDB Atlas support RAG applications? MongoDB Atlas offers vector search capabilities, making it easier to implement semantic caches and conversation stores within RAG applications. This integration facilitates the efficient retrieval of semantically similar queries and the storage of interaction histories, enhancing the application's overall performance and user experience.

5. How can semantic caching reduce query execution times in RAG applications? RAG applications can quickly retrieve cached answers for semantically similar queries without recomputing them by caching responses to queries based on their semantic content. This significantly reduces the time to generate responses, as demonstrated by the decreased query execution times upon subsequent similar queries.

6. What benefits does the LangChain-MongoDB integration offer? This integration simplifies the process of adding semantic caching and memory capabilities to RAG applications. It enables the efficient management of conversation histories and the implementation of semantic caches using MongoDB's powerful vector search features, leading to improved application performance and user experience.

7. How does one measure the impact of semantic caching on a RAG application? By monitoring query execution times before and after implementing semantic caching, developers can observe the efficiency gains the cache provides. A noticeable reduction in execution times for semantically similar queries indicates the cache's effectiveness in improving response speeds and reducing operational costs.

Top Comments in Forums

There are no comments on this article yet.

Start the Conversation

Rate this tutorial

Article

Harnessing Natural Language for MongoDB Queries With Google Gemini

Dec 13, 2024 | 8 min read

Tutorial

How to Deploy MongoDB Atlas With the Atlas Kubernetes Operator

Jul 30, 2024 | 6 min read

Tutorial

Fullstack Application Development with AWS Amplify, AppSync, and MongoDB Atlas

Dec 12, 2024 | 3 min read

Industry Event

TAIPEI, TAIWAN | IN-PERSON

Developer Day 台北站

Jan 8, 2025 | 12:30 AM - 9:00 AM UTC

Introduction
Step 1: Installing required libraries
Step 2: Database setup
Step 3: Download and prepare the dataset
Step 4: Create a naive RAG chain with MongoDB Vector Store
Step 5: Create a RAG chain with chat history
Step 6: Get faster responses using semantic cache
Conclusion
FAQs

Atlas