Enhancing LLM Accuracy Using MongoDB Vector Search and Unstructured.io Metadata
Rate this tutorial
Despite the remarkable strides in artificial intelligence, particularly in generative AI (GenAI), precision remains an elusive goal for large language model (LLM) outputs. According to the latest annual McKinsey Global Survey, “The state of AI in 2023,” GenAI has had a breakout year. Nearly one-quarter of C-suite executives personally use general AI tools for work, and over 25% of companies with AI implementations have general AI on their boards' agendas. Additionally, 40% of respondents plan to increase their organization's investment in AI due to advances in general AI. The survey reflects the immense potential and rapid adoption of AI technologies. However, the survey also points to a significant concern: inaccuracy.
Inaccuracy in LLMs often results in "hallucinations" or incorrect information due to limitations like shallow semantic understanding and varying data quality. Incorporating semantic vector search using MongoDB can help by enabling real-time querying of training data, ensuring that generated responses align closely with what the model has learned. Furthermore, adding metadata filtering extracted by Unstructured tools can refine accuracy by allowing the model to weigh the reliability of its data sources. Together, these methods can significantly minimize the risk of hallucinations and make LLMs more reliable.
This article addresses this challenge by providing a comprehensive guide on enhancing the precision of your LLM outputs using MongoDB's Vector Search and Unstructured Metadata extraction techniques. The main purpose of this tutorial is to equip you with the knowledge and tools needed to incorporate external source documents in your LLM, thereby enriching the model's responses with well-sourced and contextually accurate information. At the end of this tutorial, you can generate precise output from the OpenAI GPT-4 model to cite the source document, including the filename and page number. The entire notebook for this tutorial is available on Google Colab, but we will be going over sections of the tutorial together.
MongoDB is a NoSQL database, which stands for "Not Only SQL," highlighting its flexibility in handling data that doesn't fit well in tabular structures like those in SQL databases. NoSQL databases are particularly well-suited for storing unstructured and semi-structured data, offering a more flexible schema, easier horizontal scaling, and the ability to handle large volumes of data. This makes them ideal for applications requiring quick development and the capacity to manage vast metadata arrays.
MongoDB's robust vector search capabilities and ability to seamlessly handle vector data and metadata make it an ideal platform for improving the precision of LLM outputs. It allows for multifaceted searches based on semantic similarity and various metadata attributes. This unique feature set distinguishes MongoDB from traditional developer data platforms and significantly enhances the accuracy and reliability of the results in language modeling tasks.
The Unstructured open-source library provides components for ingesting and preprocessing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. The Unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficiently transforming unstructured data into structured outputs.
Metadata is often referred to as "data about data." It provides contextual or descriptive information about the primary data, such as its source, format, and relevant characteristics. The metadata from the Unstructured tools tracks various details about elements extracted from documents, enabling users to filter and analyze these elements based on particular metadata of interest. The metadata fields include information about the source document and data connectors.
The concept of metadata is familiar, but its application in the context of unstructured data brings many opportunities. The Unstructured package tracks a variety of metadata at the element level. This metadata can be accessed with
element.metadata
and converted to a Python dictionary representation using element.metadata.to_dict()
.In this article, we particularly focus on
filename
and page_number
metadata to enhance the traceability and reliability of the LLM outputs. By doing so, we can cite the exact location of the PDF file that provides the answer to a user query. This becomes especially crucial when the LLM answers queries related to sensitive topics such as financial, legal, or medical questions.- Extract the texts and metadata from source documents using Unstructured's partition_pdf.
- Prepare the data for storage and retrieval in MongoDB.
- Vectorize the texts using the SentenceTransformer library.
- Connect and upload records into MongoDB Atlas.
- Query the index based on embedding similarity.
- Generate the LLM output using the OpenAI Model.
Please make sure you have installed the required libraries to run the necessary code.
1 # Install Unstructured partition for PDF and dependencies 2 pip install unstructured[“pdf”] 3 !apt-get -qq install poppler-utils tesseract-ocr 4 !pip install -q --user --upgrade pillow 5 6 pip install pymongo 7 pip install sentence-transformers
We'll delve into extracting data from a PDF document, specifically the seminal "Attention is All You Need" paper, using the
partition_pdf
function from the Unstructured
library in Python. First, you'll need to import the function with from unstructured.partition.pdf import partition_pdf
. Then, you can call partition_pdf
and pass in the necessary parameters:filename
specifies the PDF file to process, which is "example-docs/Attention is All You Need.pdf."strategy
sets the extraction type, and for a more comprehensive scan, we use "hi_res."- Finally,
infer_table_structured=True
tells the function to also extract table metadata.
Properly set up, as you can see in our Colab file, the code looks like this:
1 from unstructured.partition.pdf import partition_pdf 2 3 elements = partition_pdf("example-docs/Attention is All You Need.pdf", 4 strategy="hi_res", 5 infer_table_structured=True)
By running this code, you'll populate the
elements
variable with all the extracted information from the PDF, ready for further analysis or manipulation. In the Colab’s code snippets, you can inspect the extracted texts and element metadata. To observe the sample outputs — i.e., the element type and text — please run the line below. Use a print statement, and please make sure the output you receive matches the one below.1 display(*[(type(element), element.text) for element in elements[14:18]])
Output:
1 (unstructured.documents.elements.NarrativeText, 2 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.') 3 (unstructured.documents.elements.NarrativeText, 4 '∗Equal contribution. Listing order is random....
You can also use Counter from Python Collection to count the number of element types identified in the document.
1 from collections import Counter 2 display(Counter(type(element) for element in elements)) 3 4 # outputs 5 Counter({unstructured.documents.elements.NarrativeText: 86, 6 unstructured.documents.elements.Title: 56, 7 unstructured.documents.elements.Text: 45, 8 unstructured.documents.elements.Header: 3, 9 unstructured.documents.elements.Footer: 9, 10 unstructured.documents.elements.Image: 5, 11 unstructured.documents.elements.FigureCaption: 5, 12 unstructured.documents.elements.Formula: 5, 13 unstructured.documents.elements.ListItem: 43, 14 unstructured.documents.elements.Table: 4})
Finally, you can convert the element objects into Python dictionaries using
convert_to_dict
built-in function to selectively extract and modify the element metadata.1 from unstructured.staging.base import convert_to_dict 2 3 # built-in function to convert elements into Python dictionary 4 records = convert_to_dict(elements) 5 6 # display the first record 7 records[0] 8 9 # output 10 {'type': 'NarrativeText', 11 'element_id': '6b82d499d67190c0ceffe3a99958e296', 12 'metadata': {'coordinates': {'points': ((327.6542053222656, 13 199.8135528564453), 14 (327.6542053222656, 315.7165832519531), 15 (1376.0062255859375, 315.7165832519531), 16 (1376.0062255859375, 199.8135528564453)), 17 'system': 'PixelSpace', 18 'layout_width': 1700, 19 'layout_height': 2200}, 20 'filename': 'Attention is All You Need.pdf', 21 'last_modified': '2023-10-09T20:15:36', 22 'filetype': 'application/pdf', 23 'page_number': 1, 24 'detection_class_prob': 0.5751863718032837}, 25 'text': 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.'}
Step 2a: Vectorize the texts using the SentenceTransformer library.
We must include the extracted element metadata when storing and retrieving the texts from MongoDB Atlas to enable data retrieval with metadata and vector search.
First, we vectorize the texts to perform a similarity-based vector search. In this example, we use
microsoft/mpnet-base
from the Sentence Transformer library. This model has a 768 embedding size.1 from sentence_transformers import SentenceTransformer 2 from pprint import pprint 3 4 model = SentenceTransformer('microsoft/mpnet-base') 5 6 # Let's test and check the number of embedding size using this model 7 emb = model.encode("this is a test").tolist() 8 print(len(emb)) 9 print(emb[:10]) 10 print("\n") 11 12 # output 13 768 14 [-0.15820945799350739, 0.008249259553849697, -0.033347081393003464, …]
It is important to use a model with the same embedding size defined in MongoDB Atlas Index. Be sure to use the embedding size compatible with MongoDB Atlas indexes. You can define the index using the JSON syntax below:
1 { 2 "type": "vectorSearch, 3 "fields": [{ 4 "path": "embedding", 5 "dimensions": 768, # the dimension of `mpnet-base` model 6 "similarity": "euclidean", 7 "type": "vector" 8 }] 9 }
Copy and paste the JSON index into your MongoDB collection so it can index the
embedding
field in the records. Please view this documentation on how to index vector embeddings for Vector Search.Next, create the text embedding for each record before uploading them to MongoDB Atlas:
1 for record in records: 2 txt = record['text'] 3 4 # use the embedding model to vectorize the text into the record 5 record['embedding'] = model.encode(txt).tolist() 6 7 # print the first record with embedding 8 records[0] 9 10 # output 11 {'type': 'NarrativeText', 12 'element_id': '6b82d499d67190c0ceffe3a99958e296', 13 'metadata': {'coordinates': {'points': ((327.6542053222656, 14 199.8135528564453), 15 (327.6542053222656, 315.7165832519531), 16 (1376.0062255859375, 315.7165832519531), 17 (1376.0062255859375, 199.8135528564453)), 18 'system': 'PixelSpace', 19 'layout_width': 1700, 20 'layout_height': 2200}, 21 'filename': 'Attention is All You Need.pdf', 22 'last_modified': '2023-10-09T20:15:36', 23 'filetype': 'application/pdf', 24 'page_number': 1, 25 'detection_class_prob': 0.5751863718032837}, 26 'text': 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.', 27 'embedding': [-0.018366225063800812, 28 -0.10861606895923615, 29 0.00344603369012475, 30 0.04939081519842148, 31 -0.012352174147963524, 32 -0.04383034259080887,...], 33 '_id': ObjectId('6524626a6d1d8783bb807943')} 34 }
Step 2b: Connect and upload records into MongoDB Atlas
Before we can store our records on MongoDB, we will use the PyMongo library to establish a connection to the target MongoDB database and collection. Use this code snippet to connect and test the connection (see the MongoDB documentation on connecting to your cluster).
1 from pymongo.mongo_client import MongoClient 2 from pymongo.server_api import ServerApi 3 4 uri = "<<REPLACE THE URI TO CONNECT TO CLUSTER>>" 5 6 # Create a new client and connect to the server 7 client = MongoClient(uri, server_api=ServerApi('1')) 8 9 # Send a ping to confirm a successful connection 10 try: 11 client.admin.command('ping') 12 print("Pinged your deployment. You successfully connected to MongoDB!") 13 except Exception as e: 14 print(e)
Once run, the output: “Pinged your deployment. You successfully connected to MongoDB!” will appear.
Next, we can upload the records using PyMongo's
insert_many
function.To do this, we must first grab our MongoDB database connection string. Please make sure the database and collection names match with the ones in MongoDB Atlas.
1 db_name = "unstructured_db" 2 collection_name = "unstructured_col" 3 4 # delete all first 5 client[db_name][collection_name].delete_many({}) 6 7 # insert 8 client[db_name][collection_name].insert_many(records)
Let’s preview the records in MongoDB Atlas:
Step 2c: Query the index based on embedding similarity
Now, we can retrieve the relevant records by computing the similarity score defined in the index vector search. When a user sends a query, we need to vectorize it using the same embedding model we used to store the data. Using the
aggregate
function, we can pass a pipeline
that contains the information to perform a vector search.Now that we have the records stored in MongoDB Atlas, we can search the relevant texts using the vector search. To do so, we need to vectorize the query using the same embedding model and use the aggregate function to retrieve the records from the index.
In the pipeline, we will specify the following:
- index: The name of the vector search index in the collection
- vector: The vectorized query from the user
- k: Number of the most similar records we want to extract from the collection
- score: The similarity score generated by MongoDB Atlas
1 query = "Does the encoder contain self-attention layers?" 2 vector_query = model.encode(query).tolist() 3 4 pipeline = [ 5 { 6 "$vectorSearch": { 7 "index":"default", 8 "queryVector": vector_query, 9 "path": "embedding", 10 "limit": 5, 11 "numCandidates": 50 12 } 13 }, 14 { 15 "$project": { 16 "embedding": 0, 17 "_id": 0, 18 "score": { 19 "$meta": "searchScore" 20 }, 21 } 22 } 23 ] 24 25 results = list(client[db_name][collection_name].aggregate(pipeline))
The above pipeline will return the top five records closest to the user’s query embedding. We can define
k
to retrieve the top-k records in MongoDB Atlas. Please note that the results contain the metadata
, text
, and score
. We can use this information to generate the LLM output in the following step.Here’s one example of the top five nearest neighbors from the query above:
1 {'element_id': '7128012294b85295c89efee3bc5e72d2', 2 'metadata': {'coordinates': {'layout_height': 2200, 3 'layout_width': 1700, 4 'points': [[290.50477600097656, 5 1642.1170677777777], 6 [290.50477600097656, 7 1854.9523748867755], 8 [1403.820083618164, 9 1854.9523748867755], 10 [1403.820083618164, 11 1642.1170677777777]], 12 'system': 'PixelSpace'}, 13 'detection_class_prob': 0.9979791045188904, 14 'file_directory': 'example-docs', 15 'filename': 'Attention is All You Need.pdf', 16 'filetype': 'application/pdf', 17 'last_modified': '2023-09-20T17:08:35', 18 'page_number': 3, 19 'parent_id': 'd1375b5e585821dff2d1907168985bfe'}, 20 'score': 0.2526094913482666, 21 'text': 'Decoder: The decoder is also composed of a stack of N = 6 identical ' 22 'layers. In addition to the two sub-layers in each encoder layer, ' 23 'the decoder inserts a third sub-layer, which performs multi-head ' 24 'attention over the output of the encoder stack. Similar to the ' 25 'encoder, we employ residual connections around each of the ' 26 'sub-layers, followed by layer normalization. We also modify the ' 27 'self-attention sub-layer in the decoder stack to prevent positions ' 28 'from attending to subsequent positions. This masking, combined with ' 29 'fact that the output embeddings are offset by one position, ensures ' 30 'that the predictions for position i can depend only on the known ' 31 'outputs at positions less than i.', 32 'type': 'NarrativeText'}
Step 3: Generate the LLM output with source document citation
We can generate the output using the OpenAI GPT-4 model. We will use the
ChatCompletion
function from OpenAI API for this final step. ChatCompletion API processes a list of messages to generate a model-driven response. Designed for multi-turn conversations, they're equally adept at single-turn tasks. The primary input is the 'messages' parameter, comprising an array of message objects with designated roles ("system", "user", or "assistant") and content. Usually initiated with a system message to guide the assistant's behavior, conversations can vary in length with alternating user and assistant messages. While the system message is optional, its absence may default the model to a generic helpful assistant behavior.You’ll need an OpenAI API key to run the inferences. Before attempting this step, please ensure you have an OpenAI account. Assuming you store your OpenAI API key in your environment variable, you can import it using the
os.getenv
function:1 import os 2 import openai 3 4 # Get the API key from the env 5 openai.api_key = os.getenv("OPENAI_API_KEY")
Next, having a compelling prompt is crucial for generating a satisfactory result. Here’s the prompt to generate the output with specific reference where the information comes from — i.e., filename and page number.
1 response = openai.ChatCompletion.create( 2 model="gpt-4", 3 messages=[ 4 {"role": "system", "content": "You are a useful assistant. Use the assistant's content to answer the user's query \ 5 Summarize your answer using the 'texts' and cite the 'page_number' and 'filename' metadata in your reply."}, 6 {"role": "assistant", "content": context}, 7 {"role": "user", "content": query}, 8 ], 9 temperature = 0.2 10 )
In this Python script, a request is made to the OpenAI GPT-4 model through the
ChatCompletion.create
method to process a conversation. The conversation is structured with predefined roles and messages. It is instructed to generate a response based on the provided context and user query, summarizing the answer while citing the page number and file name. The temperature
parameter set to 0.2 influences the randomness of the output, favoring more deterministic responses.One of the key features of leveraging unstructured metadata in conjunction with MongoDB's Vector Search is the ability to provide highly accurate and traceable outputs.
1 User query: "Does the encoder contain self-attention layers?"
You can insert this query into the ChatCompletion API as the “user” role and the context from MongoDB retrieval results as the “assistant” role. To enforce the model responds with the filename and page number, you can provide the instruction in the “system” role.
1 response = openai.ChatCompletion.create( 2 model="gpt-4", 3 messages=[ 4 {"role": "system", "content": "You are a useful assistant. Use the assistant's content to answer the user's query \ 5 Summarize your answer using the 'texts' and cite the 'page_number' and 'filename' metadata in your reply."}, 6 {"role": "assistant", "content": context}, 7 {"role": "user", "content": query}, 8 ], 9 temperature = 0.2 10 ) 11 12 print(response) 13 14 # output 15 { 16 "id": "chatcmpl-87rNcLaEYREimtuWa0bpymWiQbZze", 17 "object": "chat.completion", 18 "created": 1696884180, 19 "model": "gpt-4-0613", 20 "choices": [ 21 { 22 "index": 0, 23 "message": { 24 "role": "assistant", 25 "content": "Yes, the encoder does contain self-attention layers. This is evident from the text on page 5 of the document \"Attention is All You Need.pdf\"." 26 }, 27 "finish_reason": "stop" 28 } 29 ], 30 "usage": { 31 "prompt_tokens": 1628, 32 "completion_tokens": 32, 33 "total_tokens": 1660 34 } 35 }
Source document:
LLM Output:
The highly specific output cites information from the source document, "Attention is All You Need.pdf," stored in the 'example-docs' directory. The answers are referenced with exact page numbers, making it easy for anyone to verify the information. This level of detail is crucial when answering queries related to research, legal, or medical questions, and it significantly enhances the trustworthiness and reliability of the LLM outputs.
This article presents a method to enhance LLM precision using MongoDB's Vector Search and Unstructured Metadata extraction techniques. These approaches, facilitating real-time querying and metadata filtering, substantially mitigate the risk of incorrect information generation. MongoDB's capabilities, especially in handling vector data and facilitating multifaceted searches, alongside the Unstructured library's data processing efficiency, emerge as robust solutions. These techniques not only improve accuracy but also enhance the traceability and reliability of LLM outputs, especially when dealing with sensitive topics, equipping users with the necessary tools to generate more precise and contextually accurate outputs from LLMs.
Ready to get started? Request your Unstructured API key today and unlock the power of Unstructured API and Connectors. Join the Unstructured community group to connect with other users, ask questions, share your experiences, and get the latest updates. We can’t wait to see what you’ll build.