How to Build a RAG System With LlamaIndex, OpenAI, and MongoDB Vector Database
Rate this tutorial
Large language models (LLMs) substantially benefit business applications, especially in use cases surrounding productivity. Although LLMs and their applications are undoubtedly advantageous, relying solely on the parametric knowledge of LLMs to respond to user inputs and prompts proves insufficient for private data or queries dependent on real-time data. This is why a non-parametric secure knowledge source that holds sensitive data and can be updated periodically is required to augment user inputs to LLMs with current and relevant information.
Retrieval-augmented generation (RAG) is a system design pattern that leverages information retrieval techniques and generative AI models to provide accurate and relevant responses to user queries by retrieving semantically relevant data to supplement user queries with additional context, combined as input to LLMs.
Key takeaways:
- Overview of RAG
- Understanding the AI Stack
- How to build an end-to-end RAG system with MongoDB, LlamaIndex, and OpenAI
This tutorial will implement an end-to-end RAG system using the OLM (OpenAI, LlamaIndex, and MongoDB) or POLM (Python, OpenAI, LlamaIndex, MongoDB) AI Stack. The AI stack, or GenAI stack, refers to the composition of models, databases, libraries, and frameworks used to build and develop modern applications with generative AI capabilities.
A typical AI stack implements an infrastructure that leverages parametric knowledge from the LLM and non-parametric knowledge from data to augment user queries. Components of the AI stack include the following: models, orchestrators or integrators, and operational and vector databases. In this tutorial, MongoDB will act as both the operational and vector database.
AI stack component | OLM/POLM stack |
---|---|
Language | P ython |
Model provider | O penAI (GPT-3.5, GPT-4) |
LLM orchestrator | L lamaIndex |
Operational and vector database | M ongoDB Atlas |
All implementation steps outlined in this article can be found in the repository. The content of the following steps explains in some detail library classes, methods, and processes that are used to achieve the objective of implementing a RAG system.
The code snippet below installs various libraries that will provide functionalities to access LLMs, reranking models, databases, and collection methods, abstracting complexities associated with extensive coding into a few lines and method calls.
- LlamaIndex : data framework that provides functionalities to connect data sources (files, PDFs, website or data source) to both closed (OpenAI, Cohere) and open source (Llama) large language models; the LlamaIndex framework abstracts complexities associated with data ingestion, RAG pipeline implementation, and development of LLM applications (chatbots, agents, etc.).
- LlamaIndex (MongoDB): LlamaIndex extension library that imports all the necessary methods to connect to and operate with the MongoDB Atlas database.
- LlamaIndex (OpenAI): LlamaIndex extension library that imports all the necessary methods to access the OpenAI embedding models.
- PyMongo: a Python library for interacting with MongoDB that enables functionalities to connect to a cluster and query data stored in collections and documents.
- Hugging Face datasets: Hugging Face library holds audio, vision, and text datasets.
- Pandas : provides data structure for efficient data processing and analysis using Python.
1 !pip install llama-index 2 3 !pip install llama-index-vector-stores-mongodb 4 5 !pip install llama-index-embeddings-openai 6 7 !pip install pymongo 8 9 !pip install datasets 10 11 !pip install pandas
The command below assigns an OpenAI API key to the environment variable OPENAI_API_KEY. This ensures LlamaIndex creates an OpenAI client with the provided OpenAI API key to access features such as LLM models (GPT-3, GPT-3.5-turbo, and GPT-4) and embedding models (text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large).
1 %env OPENAI\_API\_KEY=openai\_key\_here
The data utilised in this tutorial is sourced from Hugging Face datasets, specifically the AIatMongoDB/embedded_movies dataset. A datapoint within the movie dataset contains information corresponding to a particular movie; plot, genre, cast, runtime, and more are captured for each data point. After loading the dataset into the development environment, it is converted into a Pandas data frame object, which enables data structure manipulation and analysis with relative ease.
1 from datasets import load_dataset 2 3 importpandasaspd 4 5 # https://huggingface.co/datasets/AIatMongoDB/embedded\_movies 6 7 dataset=load_dataset("AIatMongoDB/embedded\_movies") 8 9 # Convert the dataset to a pandas dataframe 10 11 dataset_df=pd.DataFrame(dataset['train']) 12 13 dataset_df.head(5)
The operations within this step focus on enforcing data integrity and quality. The first process ensures that each data point's
plot
attribute is not empty, as this is the primary data we utilise in the embedding process. This step also ensures we remove the plot_embedding
attribute from all data points as this will be replaced by new embeddings created with a different model, the text-embedding-3-small
.1 # Remove data point where plot column is missing 2 3 dataset_df=dataset_df.dropna(subset=['plot']) 4 5 print("\nNumber of missing values in each column after removal:") 6 7 print(dataset_df.isnull().sum()) 8 9 # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with the new OpenAI embedding Model "text-embedding-3-small" 10 11 dataset_df=dataset_df.drop(columns=['plot_embedding']) 12 13 dataset_df.head(5)
An embedding object is initialised from the
OpenAIEmbedding
model, part of the llama_index.embeddings
module. Specifically, the OpenAIEmbedding
model takes two parameters: the embedding model name, text-embedding-3-small
for this tutorial, and the dimensions of the vector embedding.The code snippet below configures the embedding model, and LLM utilised throughout the development environment. The LLM utilised to respond to user queries is the default OpenAI model enabled via LlamaIndex and is initialised with the
OpenAI()
class. To ensure consistency within all consumers of LLMs and their configuration, LlamaIndex provided the "Settings" module, which enables a global configuration of the LLMs and embedding models utilised in the environment.1 from llama_index.core.settings import Settings 2 3 from llama_index.llms.openai import OpenAI 4 5 from llama_index.embeddings.openai import OpenAIEmbedding 6 7 embed_model=OpenAIEmbedding(model="text-embedding-3-small",dimensions=256) 8 9 llm=OpenAI() 10 11 Settings.llm=llm 12 13 Settings.embed_model=embed_model
Next, it's crucial to appropriately format the dataset and its contents for MongoDB ingestion. In the upcoming steps, we'll transform the current structure of the dataset,
dataset_df
— presently a DataFrame object — into a JSON string. This dataset conversion is done in the line documents = dataset_df.to_json(orient='records')
, which assigns the JSON format to the documents.By specifying orient='records', each row of the DataFrame is converted into a separate JSON object.
The following step creates a list of Python dictionaries,
documents_list
, each representing an individual record from the original DataFrame. The final step in this process is to convert each dictionary into manually constructed documents, which are first-class citizens that hold information extracted from a data source. Documents within LlamaIndex hold information, such as metadata, that is utilised in downstream processing and ingestion stages in a RAG pipeline.One important point to note is that when creating a LlamaIndex document manually, it's possible to configure the attributes of the documents that are utilised when passed as input to embedding models and LLMs. The
excluded_llm_metadata_keys
and excluded_embed_metadata_keys
arguments on the document class constructor take a list of attributes to ignore when generating inputs for downstream processes within a RAG pipeline. A reason for doing this is to limit the context utilised within embedding models for more relevant retrievals, and in the case of LLMs, this is used to control the metadata information combined with user queries. Without configuration of either of the two arguments, a document by default utilises all content in its metadata as embedding and LLM input.At the end of this step, a Python list contains several documents corresponding to each data point in the preprocessed dataset.
1 import json 2 from llama_index.core import Document 3 from llama_index.core.schema import MetadataMode 4 5 # Convert the DataFrame to a JSON string representation 6 documents_json = dataset_df.to_json(orient='records') 7 # Load the JSON string into a Python list of dictionaries 8 documents_list = json.loads(documents_json) 9 10 llama_documents = [] 11 12 for document in documents_list: 13 14 # Value for metadata must be one of (str, int, float, None) 15 document["writers"] = json.dumps(document["writers"]) 16 document["languages"] = json.dumps(document["languages"]) 17 document["genres"] = json.dumps(document["genres"]) 18 document["cast"] = json.dumps(document["cast"]) 19 document["directors"] = json.dumps(document["directors"]) 20 document["countries"] = json.dumps(document["countries"]) 21 document["imdb"] = json.dumps(document["imdb"]) 22 document["awards"] = json.dumps(document["awards"]) 23 24 25 # Create a Document object with the text and excluded metadata for llm and embedding models 26 llama_document = Document( 27 text=document["fullplot"], 28 metadata=document, 29 excluded_llm_metadata_keys=["fullplot", "metacritic"], 30 excluded_embed_metadata_keys=["fullplot", "metacritic", "poster", "num_mflix_comments", "runtime", "rated"], 31 metadata_template="{key}=>{value}", 32 text_template="Metadata: {metadata_str}\n-----\nContent: {content}", 33 ) 34 35 llama_documents.append(llama_document) 36 37 # Observing an example of what the LLM and Embedding model receive as input 38 print( 39 "\nThe LLM sees this: \n", 40 llama_documents[0].get_content(metadata_mode=MetadataMode.LLM), 41 ) 42 print( 43 "\nThe Embedding model sees this: \n", 44 llama_documents[0].get_content(metadata_mode=MetadataMode.EMBED), 45 )
The final step of processing before ingesting the data to the MongoDB vector store is to convert the list of LlamaIndex documents into another first-class citizen data structure known as nodes. Once we have the nodes generated from the documents, the next step is to generate embedding data for each node using the content in the text and metadata attributes.
1 from llama_index.core.node_parser import SentenceSplitter 2 3 parser = SentenceSplitter() 4 nodes = parser.get_nodes_from_documents(llama_documents) 5 6 for node in nodes: 7 node_embedding = embed_model.get_text_embedding( 8 node.get_content(metadata_mode="all") 9 ) 10 node.embedding = node_embedding 11
Before moving forward, ensure the following prerequisites are met:
- Database cluster setup on MongoDB Atlas
- Obtained the URI to your cluster
For assistance with database cluster setup and obtaining the URI, refer to our guide for setting up a MongoDB cluster, and our guide to get your connection string. Alternatively, follow Step 5 of this article on using embeddings in a RAG system, which offers detailed instructions on configuring and setting up the database cluster.
Once you have successfully created a cluster, create the database and collection within the MongoDB Atlas cluster by clicking + Create Database. The database will be named
movies
, and the collection will be named movies_records
.Securely store the URI within your development environment after setting up the database and obtaining the Atlas cluster connection URI.
This guide uses Google Colab, which offers a feature for secure storage of environment secrets. These secrets can then be accessed within the development environment. Specifically, the code
mongo_uri = userdata.get('MONGO_URI')
retrieves the URI from the secure storage.The code snippet below also utilises PyMongo to create a MongoDB client object, representing the connection to the cluster and enabling access to its databases and collections.
1 import pymongo 2 from google.colab import userdata 3 4 def get_mongo_client(mongo_uri): 5 """Establish connection to the MongoDB.""" 6 try: 7 client = pymongo.MongoClient(mongo_uri) 8 print("Connection to MongoDB successful") 9 return client 10 except pymongo.errors.ConnectionFailure as e: 11 print(f"Connection failed: {e}") 12 return None 13 14 mongo_uri = userdata.get('MONGO_URI_2') 15 if not mongo_uri: 16 print("MONGO_URI not set in environment variables") 17 18 mongo_client = get_mongo_client(mongo_uri) 19 20 DB_NAME="movies" 21 COLLECTION_NAME="movies_records" 22 23 db = mongo_client[DB_NAME] 24 collection = db[COLLECTION_NAME]
The following code guarantees that the current database collection is empty by executing the
delete_many()
operation on the collection.1 # To ensure we are working with a fresh collection 2 # delete any existing records in the collection 3 4 collection.delete_many({})
Creating a vector search index within the
movies_records
collection is essential for efficient document retrieval from MongoDB into our development environment. To achieve this, refer to the official guide on vector search index creation.In the creation of a vector search index using the JSON editor on MongoDB Atlas, ensure your vector search index is named
vector_index
and the vector search index definition is as follows:1 { 2 "fields": [ 3 { 4 "numDimensions": 256, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 } 9 ] 10 }
After setting up the vector search index, data can be ingested and retrieved efficiently. Data ingestion is a trivial process achieved with less than three lines when leveraging LlamaIndex.
Up to this point, we have successfully done the following:
- Loaded data sourced from Hugging Face
- Provided each data point with embedding using the OpenAI embedding model
- Set up a MongoDB database designed to store vector embeddings
- Established a connection to this database from our development environment
- Defined a vector search index for efficient querying of vector embeddings
The code snippet below also initialises a MongoDB Atlas vector store object via the LlamaIndex constructor
MongoDBAtlasVectorSearch
. It's important to note that in this step, we reference the name of the vector search index previously created via the MongoDB Cloud Atlas interface. For this specific use case, the index name is vector_index
.The crucial method that executes the ingestion of nodes into a specified vector store is the .add() method of the LlamaIndex MongoDB instance.
1 from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch 2 3 vector_store = MongoDBAtlasVectorSearch(mongo_client, db_name=DB_NAME, collection_name=COLLECTION_NAME, index_name="vector_index") 4 vector_store.add(nodes)
The last line in the code snippet above creates a LlamaIndex index. Within LlamaIndex, when documents are loaded into any of the index abstraction classes —
SummaryIndex
, ``TreeIndex,
KnowledgeGraphIndex, and especially
VectorStoreIndex``` — an index that stores a representation of the original document is built in an in-memory vector store that also stores embeddings.But since the MongoDB Atlas vector database is utilised in this RAG system to store the embeddings and also the index for our document, LlamaIndex enables the retrieval of the index from Atlas via the
from_vector_store
method of the VectorStoreIndex
class.1 from llama_index.core import VectorStoreIndex, StorageContext 2 index = VectorStoreIndex.from_vector_store(vector_store)
The next step involves creating a LlamaIndex query engine. The query engine enables the functionality to utilise natural language to retrieve relevant, contextually appropriate information from a data index. The
as_query_engine
method provided by LlamaIndex abstracts the complexities of AI engineers and developers writing the implementation code to process queries appropriately for extracting information from a data source.For our use case, the query engine satisfies the requirement of building a question-and-answer application. However, LlamaIndex does provide the ability to construct a chat-like application with the Chat Engine functionality.
1 import pprint 2 from llama_index.core.response.notebook_utils import display_response 3 4 query_engine = index.as_query_engine(similarity_top_k=3) 5 query = "Recommend a romantic movie suitable for the christmas season and justify your selecton" 6 response = query_engine.query(query) 7 display_response(response) 8 pprint.pprint(response.source_nodes)
Incorporating RAG architectural design patterns improves LLM performance within modern generative AI applications and introduces a cost-conscious approach to building robust AI infrastructure. Building a robust RAG system with minimal code implementation with components such as MongoDB as a vector database and LlamaIndex as the LLM orchestrator is a straightforward process, as this article demonstrates.
In particular, this tutorial covered the implementation of a RAG system that leverages the combined capabilities of Python, OpenAI, LlamaIndex, and the MongoDB vector database, also known as the POLM AI stack.
It should be mentioned that fine-tuning is still a viable strategy for improving the capabilities of LLMs and updating their parametric knowledge. However, for AI engineers who consider the economics of building and maintaining GenAI applications, exploring cost-effective methods that improve LLM capabilities is worth considering, even if it is experimental.
The associated cost of data sourcing, the acquisition of hardware accelerators, and the domain expertise needed for fine-tuning LLMs and foundation models often entail significant investment, making exploring more cost-effective methods, such as RAG systems, an attractive alternative.
Notably, the cost implications of fine-tuning and model training underscore the need for AI engineers and developers to adopt a cost-saving mindset from the early stages of an AI project. Most applications today already, or will, have some form of generative AI capability supported by an AI infrastructure. To this point, it becomes a key aspect of an AI engineer's role to communicate and express the value of exploring cost-effective solutions to stakeholders and key decision-makers when developing AI infrastructure.
Q: What is a retrieval-augmented generation (RAG) system?
Retrieval-augmented generation (RAG) is a design pattern that improves the capabilities of LLMs by using retrieval models to fetch semantically relevant information from a database. This additional context is combined with the user's query to generate more accurate and relevant responses from LLMs.
Q: What are the key components of an AI stack in a RAG system?
The essential components include models (like GPT-3.5, GPT-4, or Llama), orchestrators or integrators for managing interactions between LLMs and data sources, and operational and vector databases for storing and retrieving data efficiently.
Top Comments in Forums
There are no comments on this article yet.