Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain
Richmond Alake, Apoorva Joshi15 min read • Published Aug 13, 2024 • Updated Aug 13, 2024
Rate this tutorial
Retrieval-augmented generation (RAG) is an architectural design pattern prevalent in modern AI applications that provides generative AI functionalities. RAG has gained adoption within generative applications due to its additional benefit of grounding the responses and outputs of large language models (LLMs) with some relevant, factual, and updated information. The key contribution of RAG is the supplementing of non-parametric knowledge with the parametric knowledge of the LLM to generate adequate responses to user queries.
Modern AI applications that leverage LLMs and generative AI require more than effective response capabilities. AI engineers and developers should consider two other functionalities before moving RAG applications to production. Semantic caching and memory are two important capabilities for generative AI applications that extend the usefulness of modern AI applications by reducing infrastructure costs, response latency, and conversation storage.
Semantic caching is a process that utilizes a data store to keep a record of the queries and their results based on the semantics or context within the queries themselves.
This means that, as opposed to a traditional cache that caches data based on exact matches of data requests or specific identifiers, a semantic cache understands and leverages the meaning and relationships inherent in the data. Within an LLM or RAG application, this means that user queries that are both exact matches and contextually similar to any queries that have been previously cached will benefit from an efficient information retrieval process.
Take, for example, an e-commerce platform's customer support chatbot; integrating semantic caching enables the system to respond to inquiries by understanding the context behind user queries. So, whether a customer asks about the "best smartphone for night photography" or "a phone for night photos," the chatbot can leverage its semantic cache to pull relevant, previously stored responses, improving both the efficiency and relevance of its answers.
LLM-powered chatbot interfaces are now prevalent in generative AI applications. Still, the conversations held between LLM and application users must be stored and retrieved to create a coherent and contextually relevant interaction history. The benefits of having a reference of interaction history lie in providing additional context to LLMs, understanding previously held conversations, improving the personalization of GenAI applications, and enabling the chatbot to provide more accurate responses to queries.
MongoDB Atlas vector search capabilities enable the creation of a semantic cache, and the new LangChain-MongoDB integration makes integrating this cache in RAG applications easier. The LangChain-MongoDB integration also makes implementing a conversation store for interactions with RAG applications easier.
Here's what’s covered in this tutorial:
- How to implement memory and storage of conversation history using LangChain and MongoDB
- How to implement semantic cache using LangChain and MongoDB
- Overview of semantic cache and memory utilization within RAG applications
The following GitHub repository contains all implementations presented in this tutorial, along with other use cases and examples of RAG implementations.
This section guides you through the installation process of the essential libraries needed to implement the RAG application, complete with memory and history capabilities, within your current development environment. Here is the list of required libraries:
- datasets: Python library to get access to datasets available on Hugging Face Hub
- langchain: Python toolkit for LangChain
- langchain-mongodb: Python package to use MongoDB as a vector store, semantic cache, chat history store, etc., in LangChain
- langchain-openai: Python package to use OpenAI models with LangChain
- pymongo: Python toolkit for MongoDB
- pandas: Python library for data analysis, exploration, and manipulation
1 ! pip install -qU datasets langchain langchain-mongodb langchain-openai pymongo pandas
Do note that this tutorial utilizes OpenAI embedding and base models. To access the models, ensure you have an OpenAI API key.
In your development environment, create a reference to the OpenAI API key.
1 import getpass 2 OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key:")
To handle the requirements for equipping the RAG application with the capabilities of storing interaction or conversation history and a semantic cache, two new collections must be created alongside the collection that will hold the main application data.
Creating a database and collection within MongoDB is made simple with MongoDB Atlas.
- Register a free Atlas account or sign in to your existing Atlas account.
- Follow the instructions (select Atlas UI as the procedure) to deploy your first cluster.
- Create the database: `langchain_chatbot`.
- Within the database` langchain_chatbot`, create the following collections:
data
: Hold all data that acts as a knowledge source for the chatbot.history
: Hold all conversations held between the chatbot and the application user.semantic_cache
: Hold all queries made to the chatbot along with their LLM responses.
- Create a vector search index named
vector_index
for thedata
collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of thedata
collection vector search index.
1 { 2 "fields": [ 3 { 4 "numDimensions": 1536, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 } 9 ] 10 }
6. Create a vector search index with a text filter named
vector_index
for the semantic_cache
collection. This index enables the RAG application to retrieve responses to queries semantically similar to a current query asked by the application user. Below is the JSON definition of the semantic_cache
collection vector search index.1 { 2 "fields": [ 3 { 4 "numDimensions": 1536, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 }, 9 { 10 "path": "llm_string", 11 "type": "filter" 12 } 13 ] 14 }
By the end of this step, you should have a database with three collections and two defined vector search indexes. The final step of this section is to obtain the connection URI string to the created Atlas cluster to establish a connection between the databases and the current development environment. Follow the steps to get the connection string from the Atlas UI.
In your development environment, create a reference to the MongoDB URI string.
1 MONGODB_URI = getpass.getpass("Enter your MongoDB connection string:")
This tutorial uses MongoDB’s embedded_movies dataset. A datapoint within the movie dataset contains information corresponding to a particular movie; plot, genre, cast, runtime, and more are captured for each data point. After loading the dataset into the development environment, it is converted into a Pandas data frame object, which enables data structure manipulation and analysis with relative ease.
1 from datasets import load_dataset 2 import pandas as pd 3 4 data = load_dataset("MongoDB/embedded_movies") 5 df = pd.DataFrame(data["train"]) 6 7 # Only keep records where the fullplot field is not null 8 df = df[df["fullplot"].notna()] 9 10 # Renaming the embedding field to "embedding" -- required by LangChain 11 df.rename(columns={"plot_embedding": "embedding"}, inplace=True)
The code above executes the following operations:
- Import the
load_dataset
module from thedatasets
library, which enables the appropriate dataset to be loaded for this tutorial by specifying the path. The full dataset is loaded environment and referenced by the variabledata
. - Only the dataset's train partition is required to be utilized; the variable
df
holds a reference to the dataset training partition as a Pandas DataFrame. - The DataFrame is filtered to only keep records where the
fullplot
field is not null. This step ensures that any subsequent operations or analyses that rely on thefullplot
field, such as the embedding process, will not be hindered by missing data. The filtering process uses pandas' notna() method to check for non-null entries in thefullplot
column. - The column
plot_embedding
in the DataFrame is renamed toembedding
. This step is necessary for compatibility with LangChain, which requires an input field named embedding.
By the end of the operations in this section, we have a full dataset that acts as a knowledge source for the chatbot and is ready to be ingested into the
data
collection in the langchain_chatbot
database.Before adding chat history and caching, let’s first see how to create a simple RAG chain using LangChain, with MongoDB as the vector store. Here’s what the workflow looks like:
The user question is embedded, and relevant documents are retrieved from the MongoDB vector store. The retrieved documents, along with the user query, are passed as a prompt to the LLM, which generates an answer to the question.
Let’s first ingest data into a MongoDB collection. We will use this collection as the vector store for our RAG chain.
1 from pymongo import MongoClient 2 3 # Initialize MongoDB python client 4 client = MongoClient(MONGODB_URI) 5 6 DB_NAME = "langchain_chatbot" 7 COLLECTION_NAME = "data" 8 ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index" 9 collection = client[DB_NAME][COLLECTION_NAME]
The code above creates a MongoDB client and defines the database
langchain_chatbot
and collection data
where we will store our data. Remember, you will also need to create a vector search index to efficiently retrieve data from the MongoDB vector store, as documented in Step 2 of this tutorial. To do this, refer to our official vector search index creation guide.While creating the vector search index for the
data
collection, ensure that it is named vector_index
and that the index definition looks as follows:1 { 2 "fields": [ 3 { 4 "numDimensions": 1536, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 } 9 ] 10 }
NOTE: We set
numDimensions
to 1536
because we use OpenAI’s text-embedding-ada-002
model to create embeddings.Next, we delete any existing documents from the `data` collection and ingest our data into it:
1 # Delete any existing records in the collection 2 collection.delete_many({}) 3 4 # Data Ingestion 5 records = df.to_dict('records') 6 collection.insert_many(records) 7 8 print("Data ingestion into MongoDB completed")
Ingesting data into a MongoDB collection from a pandas DataFrame is a straightforward process. We first convert the DataFrame to a list of dictionaries and then utilize the
insert_many
method to bulk ingest documents into the collection.With our data in MongoDB, let’s use it to construct a vector store for our RAG chain:
1 from langchain_openai import OpenAIEmbeddings 2 from langchain_mongodb import MongoDBAtlasVectorSearch 3 4 # Using the text-embedding-ada-002 since that's what was used to create embeddings in the movies dataset 5 embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model="text-embedding-ada-002") 6 7 # Vector Store Creation 8 vector_store = MongoDBAtlasVectorSearch.from_connection_string( 9 connection_string=MONGODB_URI, 10 namespace=DB_NAME + "." + COLLECTION_NAME, 11 embedding= embeddings, 12 index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, 13 text_key="fullplot" 14 )
We use the
from_connection_string
method of the MongoDBAtlasVectorSearch
class from the langchain_mongodb
integration to create a MongoDB vector store from a MongoDB connection URI. The get_connection_string
method takes the following arguments:- connection_string: MongoDB connection URI
- namespace: A valid MongoDB namespace (database and collection)
- embedding: Embedding model to use to generate embeddings for a vector search
- index_name: MongoDB Atlas vector search index name
- text_key: Field in the ingested documents that contain the text
The next step is to use the MongoDB vector store as a retriever in our RAG chain. In LangChain, a retriever is an interface that returns documents given a query. You can use a vector store as a retriever by using the
as_retriever
method:1 retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
as_retriever
can take arguments such as search_type
— i.e., what metric to use to retrieve documents. Here, we choose similarity
since we want to retrieve the most similar documents to a given query. We can also specify additional search arguments such as k
— i.e., the number of documents to retrieve. In our example, we set it to 5, which means the 5 most similar documents will be retrieved for a given query.The final step is to put all of these pieces together to create a RAG chain.
NOTE: Chains in LangChain are a sequence of calls either to an LLM, a
tool, or a data processing step. The recommended way to compose chains
in LangChain is using the LangChain Expression
Language
(LCEL). Each component in a chain is referred to as a
Runnable
and
can be invoked, streamed, etc., independently of other components in
the chain.1 from langchain_openai import ChatOpenAI 2 from langchain_core.prompts import ChatPromptTemplate 3 from langchain_core.runnables import RunnablePassthrough 4 from langchain_core.output_parsers import StrOutputParser 5 6 # Generate context using the retriever, and pass the user question through 7 retrieve = {"context": retriever | (lambda docs: "\n\n".join([d.page_content for d in docs])), "question": RunnablePassthrough()} 8 template = """Answer the question based only on the following context: \ 9 {context} 10 11 Question: {question} 12 """ 13 # Defining the chat prompt 14 prompt = ChatPromptTemplate.from_template(template) 15 # Defining the model to be used for chat completion 16 model = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY) 17 # Parse output as a string 18 parse_output = StrOutputParser() 19 20 # Naive RAG chain 21 naive_rag_chain = ( 22 retrieve 23 | prompt 24 | model 25 | parse_output 26 )
The code snippet above does the following:
- Defines the
retrieve
component: It takes the user input (a question) and sends it to theretriever
to obtain similar documents. It also formats the output to match the input format expected by the next Runnable, which in this case is a dictionary withcontext
andquestion
as keys. TheRunnablePassthrough()
call for thequestion
key indicates that the user input is simply passed through to the next stage under thequestion
key. - Defines the
prompt
component: It crafts a prompt by populating a prompt template with thecontext
andquestion
from theretrieve
stage. - Defines the
model
component: This specifies the chat model to use. We use OpenAI — unless specified otherwise, thegpt-3.5-turbo
model is used by default. - Defines the
parse_output
component: A simple output parser parses the result from the LLM into a string. - Defines a
naive_rag_chain
: It uses LCEL pipe ( | ) notation to chain together the above components.
Let’s test out our chain by asking a question. We do this using the `invoke()` method, which is used to call a chain on an input:
1 naive_rag_chain.invoke("What is the best movie to watch when sad?") 2 Output: Once a Thief
NOTE: With complex chains, it can be hard to tell whether or not
information is flowing through them as expected. We highly recommend
using LangSmith for debugging and
monitoring in such cases. Simply grab an API
key and add the following lines
to your code to view
traces
in the LangSmith UI:
1 export LANGCHAIN_TRACING_V2=true 2 export LANGCHAIN_API_KEY=<your-api-key>
Now that we have seen how to create a simple RAG chain, let’s see how to add chat message history to it and persist it in MongoDB. The workflow for this chain looks something like this:
The user question and the chat message history (if available) are first used to craft a standalone question to retrieve documents from the vector store. The user question, along with the retrieved documents and the chat history, are then passed as inputs to an LLM to generate an answer.
Let’s first define a function to get the chat message history for a particular session from MongoDB:
1 from langchain_mongodb.chat_message_histories import MongoDBChatMessageHistory 2 from langchain_core.runnables.history import RunnableWithMessageHistory 3 from langchain_core.prompts import MessagesPlaceholder 4 5 def get_session_history(session_id: str) -> MongoDBChatMessageHistory: 6 return MongoDBChatMessageHistory(MONGODB_URI, session_id, database_name=DB_NAME, collection_name="history")
The
get_session_history
function takes in a session_id
and returns an instance of MongoDBChatMessageHistory
, which contains the chat history for that session, retrieved from the history
collection of the langchain_chatbot
database defined previously.Next, let’s define a
question_chain
Runnable, which takes the chat message history and a follow-up question as input and creates a standalone question that can be passed as input to the retriever.1 standalone_system_prompt = """ 2 Given a chat history and a follow-up question, rephrase the follow-up question to be a standalone question. \ 3 Do NOT answer the question, just reformulate it if needed, otherwise return it as is. \ 4 Only return the final standalone question. \ 5 """ 6 standalone_question_prompt = ChatPromptTemplate.from_messages( 7 [ 8 ("system", standalone_system_prompt), 9 MessagesPlaceholder(variable_name="history"), 10 ("human", "{question}"), 11 ] 12 ) 13 14 question_chain = standalone_question_prompt | model | parse_output
The retriever chain here looks a bit different from the one in Step 5:
1 retriever_chain = RunnablePassthrough.assign(context=question_chain | retriever | (lambda docs: "\n\n".join([d.page_content for d in docs])))
Notice that the retriever chain here uses
RunnablePassthrough.assign(...)
. This adds the additional arguments passed to the assign function to the user input before passing it to the next component. This example will add the context
obtained from the retriever to the user input before passing it to the prompt.The prompt, in this case, consists of the user’s question, the retrieved context, and the session’s chat history:
1 rag_system_prompt = """Answer the question based only on the following context: \ 2 {context} 3 """ 4 rag_prompt = ChatPromptTemplate.from_messages( 5 [ 6 ("system", rag_system_prompt), 7 MessagesPlaceholder(variable_name="history"), 8 ("human", "{question}"), 9 ] 10 )
Finally, we create a RAG chain and wrap it under
RunnableWithMessageHistory
, which is a Runnable that manages chat message history for another Runnable, including updates.1 # RAG chain 2 rag_chain = ( 3 retriever_chain 4 | rag_prompt 5 | model 6 | parse_output 7 ) 8 9 # RAG chain with history 10 with_message_history = RunnableWithMessageHistory( 11 rag_chain, 12 get_session_history, 13 input_messages_key="question", 14 history_messages_key="history", 15 )
The parameters to the
RunnableWithMessageHistory
Runnable above are as follows:- rag_chain: Base runnable to wrap
- get_session_history: The function to get chat message history from
- input_messages_key: Key for input messages if the base runnable accepts a dictionary
- history_messages_key: Key for historical messages if the base runnable expects a message history
Let’s test out our RAG chain with history by asking a question:
1 with_message_history.invoke({"question": "What is the best movie to watch when sad?"}, {"configurable": {"session_id": "1"}}) 2 3 Output: The best movie to watch when feeling down could be "Last Action Hero." It\'s a fun and action-packed film that blends reality and fantasy, offering an escape from the real world and providing an entertaining distraction. 4 5 with_message_history.invoke({"question": "How about something more light?"}, {"configurable": {"session_id": "1"}}) 6 7 Output: For a lighter movie option, you might enjoy "Cousins." It\'s a comedy film set in Barcelona with action and humor, offering a fun and entertaining escape from reality. The storyline is engaging and filled with comedic moments that could help lift your spirits.
Note the arguments passed to the
invoke
calls above. The first argument is a dictionary with a question
key containing the user input. This key should match the value of the input_messages_key
argument passed to the RunnableWithMessageHistory
Runnable. The second argument is used to create a new or look up an existing chat message history that matches the specified session_id.Adding a semantic cache to the current RAG application is easy with the LangChain-MongoDB integration. This section covers the process of adding semantic caching to the RAG application, effectively reducing operational costs by storing the results of user questions and the questions themselves.
This allows the application to reference the results when the user asks a semantically similar question in the future, thereby reducing the number of API calls made to the LLM provider. The advantage of semantic caching is that it reduces both the latency and cost of the overall RAG application.
1 from langchain_mongodb.cache import MongoDBAtlasSemanticCache 2 from langchain_core.globals import set_llm_cache 3 4 set_llm_cache(MongoDBAtlasSemanticCache( 5 connection_string=MONGODB_URI, 6 embedding=embeddings, 7 collection_name="semantic_cache", 8 database_name=DB_NAME, 9 index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, 10 wait_until_ready=True # Optional, waits until the cache is ready to be used 11 ))
In the code above, the following operations are executed:
- Import the module
MongoDBAtlasSemanticCache
, which provides the interface connecting to MongoDB and creating the semantic cache-store. TheMongoDBAtlasSemanticCache
provides methods to initialize the cache, look up cached responses based on query prompts and LLM strings, update the cache with new responses, and clear cached entries either entirely or based on specific criteria. - Import the module
set_llm_cache
, which sets the LLM cache for the current application environment. - Configure the semantic cache by passing an instance of the
MongoDBAtlasSemanticCache
with the appropriate argument as a parameter to theset_llm_cache
function.
With the semantic cache now configured for our RAG application, it is possible to monitor the query execution times. This observation will highlight the efficiency gains and benefits created by implementing the semantic cache.
1 %%time 2 naive_rag_chain.invoke("What is the best movie to watch when sad?")
The code above invokes the RAG chain with the query, “What is the best movie to watch when sad?” The
%%time
command enables the measurement of the duration of a specific operation to be computed within a Jupyter Notebook or IPython environment.As shown above, the first invocation of the chain with the query executes in 1.24 seconds. The response block below shows this.
1 CPU times: user 87.8 ms, sys: 670 µs, total: 88.5 ms 2 Wall time: 1.24 s 3 'Once a Thief'
The RAG chain invoked with the query should show a reduced response time to ensure the semantic cache is working as expected.
1 %%time 2 naive_rag_chain.invoke("Which movie do I watch when sad?")
The output below shows that the same query has a significantly reduced response time.
1 CPU times: user 43.5 ms, sys: 4.16 ms, total: 47.7 ms 2 Wall time: 255 ms 3 'Once a Thief'
NOTE: Semantic cache only caches the input to the LLM. When using it in retrieval chains, remember that documents retrieved can change
between runs, resulting in cache misses for semantically similar
queries.
The RAG architectural design pattern in generative AI and modern applications has been shown to provide some benefits in making RAG applications deliver more relevant and up-to-date answers to user queries.
However, a robust modern application requires more than just a naive RAG implementation. To enhance applications' interaction with users, factors such as personalization, history, and efficient response times are considerations that AI engineers and builders should consider as early as during the POC or demo phase of development.
This tutorial has demonstrated how straightforward it is to integrate semantic caching and memory into RAG applications when facilitated by MongoDB and LangChain. Step-by-step instructions have been provided to guide the implementation of a RAG application; the creation of databases, collections, and indexes; and the utilization of LangChain to develop a RAG chain and application.
Following the steps outlined in this tutorial, developers can leverage the MongoDB Atlas Vector Search capabilities and the seamless LangChain-MongoDB integration to create more responsive, efficient, and intelligent RAG applications. The combination of semantic caching and conversation history reduces operational costs and significantly enhances the accuracy and personalization of responses generated by LLMs.
1. What is retrieval-augmented generation (RAG)?
RAG is a design pattern in AI applications that enhances the capabilities of large language models (LLMs) by grounding their responses with relevant, factual, and up-to-date information. This is achieved by supplementing LLMs' parametric knowledge with non-parametric knowledge, enabling the generation of more accurate and contextually relevant responses.
2. How does integrating memory and chat history enhance RAG applications?
Integrating memory and chat history into RAG applications allows for the retention and retrieval of past interactions between the large language model (LLM) and users. This functionality enriches the model's context awareness, enabling it to generate responses that are relevant to the immediate query and reflect the continuity and nuances of ongoing conversations. By maintaining a coherent and contextually relevant interaction history, RAG applications can offer more personalized and accurate responses, significantly enhancing the user experience and the application's overall effectiveness.
3. Why is semantic caching important in RAG applications?
Semantic caching stores the results of user queries and their associated responses based on the query's semantics. This approach allows for efficient information retrieval when semantically similar queries are made in the future, reducing API calls to LLM providers and lowering both latency and operational costs.
4. How does MongoDB Atlas support RAG applications?
MongoDB Atlas offers vector search capabilities, making it easier to implement semantic caches and conversation stores within RAG applications. This integration facilitates the efficient retrieval of semantically similar queries and the storage of interaction histories, enhancing the application's overall performance and user experience.
5. How can semantic caching reduce query execution times in RAG applications?
RAG applications can quickly retrieve cached answers for semantically similar queries without recomputing them by caching responses to queries based on their semantic content. This significantly reduces the time to generate responses, as demonstrated by the decreased query execution times upon subsequent similar queries.
6. What benefits does the LangChain-MongoDB integration offer?
This integration simplifies the process of adding semantic caching and memory capabilities to RAG applications. It enables the efficient management of conversation histories and the implementation of semantic caches using MongoDB's powerful vector search features, leading to improved application performance and user experience.
7. How does one measure the impact of semantic caching on a RAG application?
By monitoring query execution times before and after implementing semantic caching, developers can observe the efficiency gains the cache provides. A noticeable reduction in execution times for semantically similar queries indicates the cache's effectiveness in improving response speeds and reducing operational costs.
Top Comments in Forums
There are no comments on this article yet.
Related
Tutorial
Efficient Sync Solutions: Cluster-to-Cluster Sync and Live Migration to Atlas
May 10, 2024 | 5 min read
Tutorial
Part #1: Build Your Own Vector Search with MongoDB Atlas and Amazon SageMaker
Sep 18, 2024 | 4 min read