RAG Evaluation: Detecting Hallucinations With Patronus AI and MongoDB
Han Heloir, Prakul Agarwal11 min read • Published Aug 15, 2024 • Updated Aug 15, 2024
Rate this tutorial
One of the biggest pieces of feedback we hear from customers building generative AI applications is about reliability. In the past, we have partnered with Patronus AI to showcase how to build reliable AI systems with MongoDB Atlas Vector Search.
Today, we are excited to announce a partnership with Patronus AI in the launch of their state-of-the-art hallucination detection model: Lynx (8b and 70b).
This open-weights hallucination detection model outperforms all existing LLMs-as-a-judge models, including GPT-4o. This release also introduces HaluBench, a comprehensive new benchmark designed to evaluate the faithfulness of retrieval-augmented generation (RAG) systems using real-world data. This tutorial provides details on how to use open-source models with MongoDB Atlas to evaluate the reliability of RAG systems.
You can find all the code presented in the tutorial on GitHub
A leading automated AI evaluation and security company, Patronus AI’s latest development, Lynx, is an open-source hallucination detection model designed to address the critical issue of hallucinations in large language models (LLMs), where the models produce outputs that are coherent but not factually accurate or aligned with the input context.
Here are the key points about Lynx:
- Detects hallucinations, outperforming both open-source and closed-source models
- Operates in real-time, providing immediate feedback on AI response faithfulness
- Eliminates the need for manual annotations
- Enables developers to deploy reliable AI systems at scale
In conjunction with Lynx, Patronus AI has developed HaluBench, a new benchmark designed to evaluate the faithfulness of RAG systems. HaluBench sources data from various real-world domains, including finance, medical, and general knowledge, to comprehensively assess how well RAG architectures adhere to the information retrieved. This benchmark includes examples from several datasets, such as FinanceBench, DROP, COVID-QA, and PubMedQA, ensuring a wide range of scenarios for robust evaluation.
The following graph shows the accuracy of different LLMs on HaluBench from the Patronus AI paper.
Figure 1: Benchmark result from Patronus AI
The table compares the accuracy of various large language models (LLMs) on HaluBench, a benchmark for evaluating RAG systems. It includes models like GPT-4, Claude, Mistral Instruct, Llama-3-Chat, and Lynx across datasets such as HaluEval, DROP, CovidQA, and PubMedQA. Larger models, especially Lynx 70B, demonstrate higher accuracy, highlighting their effectiveness in detecting and managing hallucinations and emphasizing the benefits of larger models and specialized training for reliable RAG systems.
The introduction of Lynx and HaluBench provides essential tools for effectively detecting and mitigating hallucinations. When integrated with MongoDB, these tools enhance the development of robust and scalable generative AI applications. MongoDB's real-time data processing, flexible schema, advanced querying capabilities, and robust security measures ensure the reliability and accuracy of AI-generated content.
MongoDB Atlas is a unique, fully managed platform that integrates enterprise data, vector search, and analytics, enabling the creation of tailored AI applications. Building robust, scalable technology foundations becomes crucial as enterprises rush to obtain the transformative power of generative AI. MongoDB addresses several key challenges in this fast-evolving field, making it an ideal choice for RAG applications.
Responsive in real time: MongoDB’s architecture of real-time data interactions is crucial for generative AI systems. Its responsiveness ensures that answers are delivered to customers in seconds or even milliseconds, maintaining user engagement and satisfaction.
Agile data handling: MongoDB's agile nature and flexible data model supports rapid application iteration, allowing organizations to add new data types and features swiftly. This flexibility facilitates continuous improvement and adaptation of applications, ensuring they remain competitive and relevant.
Intelligent data integration and queries: MongoDB’s vector search key capabilities enhance the functionality of generative AI applications by supporting semantic search. In addition, MongoDB provides additional advanced query capabilities, such as full-text search, geospatial queries, and vectorized queries, which are essential for efficiently retrieving relevant data from large, diverse datasets.
Securing sensitive data: MongoDB strongly emphasizes security, especially when managing sensitive data across customer engagement platforms. It employs robust security controls to protect data against unauthorized access and ensure the integrity and confidentiality of client data. MongoDB’s comprehensive security features, including rest and transit encryption, role-based access control, and auditing, help fortify data against breaches and unauthorized access.
Everywhere: MongoDB is designed to integrate effortlessly with any cloud provider, large language model (LLM), and framework. This flexibility is crucial for supporting the dynamic needs of generative AI applications across various industries and use cases.
Retrieval-augmented generation (RAG) systems combine retrieval-based and generation-based models to produce more accurate and contextually relevant responses. However, hallucinations — where AI generates coherent but inaccurate information — pose a significant challenge, undermining reliability and trust. Robust evaluation ensures accuracy, improves reliability, and enhances user trust.
HaluBench, developed by Patronus AI, provides a comprehensive benchmark for evaluating the faithfulness of RAG systems using real-world data. Lynx enhances RAG systems by:
Real-time hallucination detection: Provides immediate feedback on AI responses, allowing rapid correction of inaccuracies
Automated evaluation: Speeds up the evaluation process by eliminating the need for manual annotations, ensuring consistent assessments
Seamless integration: Continuously monitors and evaluates content within existing RAG systems, maintaining accuracy and reliability
Benchmarking with HaluBench: When used together, Lynx and HaluBench offer a robust framework for testing RAG systems across diverse real-world scenarios, enhancing their robustness and applicability.
Figure 2: Typical RAG evaluation process
The above diagram illustrates a typical RAG evaluation process involving real-time generative AI applications, an LLM model for RAG generation, MongoDB Atlas for vector search, and Patronus AI’s Lynx as the judge.
- Real-time generative AI apps: User questions are inputted into the real-time generative AI RAG applications.
- MongoDB Atlas: MongoDB Atlas manages vector search and processes the user question to find relevant documents.
- LLM model used in RAG generation: It retrieves relevant documents and uses these to formulate a prompt. The LLM model processes the user question with retrieved documents to generate a response.
- LLM as judge (Lynx): Inputs include the retrieved documents, user question, and the LLM response. Lynx evaluates the faithfulness of the response and outputs a binary score indicating whether the response is faithful or not.
To query and retrieve results from MongoDB Atlas vector store, follow these three steps:
- Set up the database on Atlas. First, you need to create an account on MongoDB Atlas. This involves signing in to your MongoDB Atlas account, creating a new cluster, and adding a database and collection. You can skip this step if you already have your collection for vector search.
- Create an Atlas index. You can create an index either via code or using the Atlas UI. Here’s an example of how to create an index using the Atlas UI:
- Navigate to your collection.
- Click on “Atlas search” and then “Create Index.”
- Define the index fields and type.
Alternatively, you can create an index programmatically. The following index definition indexes the vector embeddings field (fieldToIndex) for performing vector search.
1 import pymongo 2 from pymongo.mongo_client import MongoClient 3 from pymongo.operations import SearchIndexModel 4 5 # Connect to your Atlas deployment 6 uri = "<connectionString>" 7 client = MongoClient(uri) 8 9 # Access your database and collection 10 database = client["<databaseName>"] 11 collection = database["<collectionName>"] 12 13 # Create your index model, then create the search index 14 search_index_model = SearchIndexModel( 15 definition={ 16 "fields": [ 17 { 18 "type": "vector", 19 "numDimensions": <numberofDimensions>, 20 "path": "<fieldToIndex>", 21 "similarity": "euclidean | cosine | dotProduct" 22 }, 23 { 24 "type": "filter", 25 "path": "<fieldToIndex>" 26 }, 27 ... 28 ] 29 }, 30 name="<index name>", 31 type="vectorSearch", 32 ) 33 34 result = collection.create_search_index(model=search_index_model) 35 print(result)
Note: The
create_search_index
only works for Atlas cluster M10+. Use Atlas UI for the M0-M5.- Perform a vector query with code. After setting up your database and creating the necessary indexes, you can perform a vector query. Below is a sample Python code using the PyMongo library:
1 import pymongo 2 import datetime 3 4 # connect to your Atlas cluster 5 client = pymongo.MongoClient("<connection-string>") 6 7 # define pipeline 8 pipeline = [ 9 { 10 '$vectorSearch': { 11 'index': "<index-name>", 12 'path': "<field-to-search>", 13 'queryVector': [<array-of-numbers>] 14 'numCandidates': <number-of-candidates>, 15 'limit': <number-of-results> 16 } 17 } 18 ] 19 20 # run pipeline 21 result = client["db_name"][collection_name"].aggregate(pipeline) 22 23 # print results 24 for i in result: 25 print(i)
Patronus provides two versions of the Lynx model:
- Using the large model: The Lynx-70B model requires significant computational resources. Ensure you have enough memory and computing power to handle it. You will need an instance with an A100 or H100 GPU. We would use vLLM to run the model on GPU and get an endpoint for our evaluation process.
- Using smaller model: If the 70B model is too large for your use case, consider using a smaller variant of the Lynx model (Lynx-8B-Instruct). This can be executed on a local system using Ollama. You can find different sizes on the Hugging Face model hub under the PatronusAI repository.
Follow these steps to download the Lynx model from Hugging Face. This involves setting up the environment and the Hugging Face CLI, logging into Hugging Face, and then downloading the model.
Step 1: Install the Hugging Face Hub CLI
First, you need to install the Hugging Face Hub CLI. This lets you interact directly with Hugging Face’s model hub from the command line.
1 pip3 install -U huggingface_hub[cli]
Step 2: Log in to Hugging Face
Next, log in to your Hugging Face account. If you don’t have an account, you’ll need to create one at Hugging Face.
1 huggingface-cli login
You will be prompted to enter your Hugging Face token. You can find your token in your Hugging Face account settings under “Access Tokens.”
Step 3: Download the Lynx Model
After logging in, you can download the Lynx model. Here is an example command to download the 70B variant of the Lynx model:
1 huggingface-cli download PatronusAI/Patronus-Lynx-8B-Instruct --local-dir Patronus_8B
This command will download the model to a local directory named Patronus_8B.
With the vLLM inference server running, you will obtain a URI (for instance, http://localhost:5123/). You can use the OpenAI API specification to send requests and evaluate the faithfulness of AI-generated responses. This section covers sending a cURL request to test the server and implementing a structured prompt template for hallucination detection.
Step 1: Create a new conda environment and install vllm
Creating a dedicated conda environment helps manage dependencies and avoid conflicts. Here’s how to set up a new environment with Python 3.10:
1 # (Recommended) Create a new conda environment. 2 conda create -n myenv python=3.10 -y 3 # Activate the newly created environment. 4 conda activate myenv
Install vLLM, a library designed to efficiently serve large language models. If you have CUDA 12.1 installed, you can install vLLM with CUDA support for better performance.
1 # Install vLLM with CUDA 12.1. 2 pip install vllm
Step 2: Run the Lynx model on a server
Once vLLM is installed, you can start the server to host the Lynx model. This involves specifying the port, model, and tokenizer. The following command runs the model on port 5123:
1 python -m vllm.entrypoints.openai.api_server --port 5123 --model PatronusAI/Patronus-Lynx-8B-Instruct --tokenizer meta-llama/Meta-Llama-3-8B
With the vLLM inference server running on http://localhost:5123/, you can use the OpenAI API specification to send requests and evaluate the faithfulness of AI-generated responses. This section covers sending a cURL request to test the server and implementing a structured prompt template for hallucination detection.
Step 1: Test the Server with a cURL Request
Verify that the server is working by sending a cURL request. This request queries the model to define what a hallucination is:
1 curl http://localhost:5123/v1/chat/completions \ 2 -H "Content-Type: application/json" \ 3 -d '{ 4 "model": "PatronusAI/Patronus-Lynx-70B-Instruct", 5 "messages": [ 6 {"role": "user", "content": "What is a hallucination?"} 7 ] 8 }'
Step 2: Define the prompt template
Use a structured prompt template to evaluate the faithfulness of AI-generated responses. The template helps ensure that the answer is faithful to the document provided and does not contain hallucinations.
1 # Prompt template 2 Given the following QUESTION, DOCUMENT and ANSWER you must analyze the provided answer and determine whether it is faithful to the contents of the DOCUMENT. 3 4 The ANSWER must not offer new information beyond the context provided in the DOCUMENT. 5 6 The ANSWER also must not contradict information provided in the DOCUMENT. 7 8 Output your final score by strictly following this format: "PASS" if the answer is faithful to the DOCUMENT and "FAIL" if the answer is not faithful to the DOCUMENT. 9 10 Show your reasoning. 11 12 -- 13 QUESTION (THIS DOES NOT COUNT AS BACKGROUND INFORMATION): 14 {{ user_input }} 15 16 -- 17 DOCUMENT: 18 {{ provided_context }} 19 20 -- 21 ANSWER: 22 {{ bot_response }} 23 24 -- 25 26 Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE". 27 28 Ensure that the JSON is valid and properly formatted. 29 30 {"REASONING": ["<your reasoning as bullet points>"], "SCORE": "<final score>"}
Step 3: Implement the evaluation function
Use Python to send a structured request to the local Lynx API, including the question, document, and answer. The following code demonstrates how to format the request and handle the response:
1 import requests 2 import json 3 4 # Define the endpoint URL 5 url = "http://localhost:5123/v1/chat/completions" 6 7 # Define the prompt template 8 prompt_template = """ 9 Given the following QUESTION, DOCUMENT and ANSWER you must analyze the provided answer and determine whether it is faithful to the contents of the DOCUMENT. 10 11 The ANSWER must not offer new information beyond the context provided in the DOCUMENT. 12 13 The ANSWER also must not contradict information provided in the DOCUMENT. 14 15 Output your final score by strictly following this format: "PASS" if the answer is faithful to the DOCUMENT and "FAIL" if the answer is not faithful to the DOCUMENT. 16 17 Show your reasoning. 18 19 -- 20 QUESTION (THIS DOES NOT COUNT AS BACKGROUND INFORMATION): 21 {user_input} 22 23 -- 24 DOCUMENT: 25 {provided_context} 26 27 -- 28 ANSWER: 29 {bot_response} 30 31 -- 32 33 Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE". 34 35 Ensure that the JSON is valid and properly formatted. 36 37 {"REASONING": ["<your reasoning as bullet points>"], "SCORE": "<final score>"} 38 """ 39 40 # Define the input data 41 data = { 42 "model": "PatronusAI/Patronus-Lynx-70B-Instruct", 43 "messages": [ 44 {"role": "system", "content": "You are an AI that evaluates the faithfulness of responses."}, 45 {"role": "user", "content": prompt_template.format( 46 user_input="What is the capital of France?", 47 provided_context="France is a country in Europe. The capital city of France is Paris.", 48 bot_response="The capital of France is Lyon." 49 )} 50 ] 51 } 52 53 # Send the request 54 response = requests.post(url, headers={"Content-Type": "application/json"}, data=json.dumps(data)) 55 56 # Process the server response 57 response_json = response.json() 58 59 # Extract reasoning and score 60 reasoning = response_json.get("choices", [{}])[0].get("message", {}).get("content", {}) 61 score = response_json.get("choices", [{}])[0].get("message", {}).get("content", {}).get("SCORE", "") 62 63 # Print the results 64 print("Reasoning:", reasoning) 65 print("Score:", score)
Patronus AI's Lynx and HaluBench significantly advance hallucination detection and RAG system evaluation. Lynx provides real-time, automated feedback, while HaluBench offers a robust benchmark using real-world data. MongoDB Atlas enhances these tools with real-time data processing, agile handling, advanced querying, and robust security. Its seamless integration with any cloud provider, LLM, and framework makes it ideal for generative AI RAG applications.
Lynx, HaluBench, and MongoDB enable the development of reliable, accurate, and scalable RAG AI applications, driving continuous improvement and innovation in generative AI.
1. What are Lynx and HaluBench, and why are they important for RAG systems?
Lynx is an open-source hallucination detection model developed by Patronus AI. It detects hallucinations in real-time, outperforming both open-source and closed-source models. HaluBench is a comprehensive benchmark for evaluating the faithfulness of retrieval-augmented generation (RAG) systems using real-world data. Together, they provide a robust framework for testing and improving RAG systems, enabling developers to deploy more reliable AI applications at scale.
2. Why use MongoDB Atlas for RAG applications?
MongoDB Atlas is ideal for RAG applications because it:
- Provides real-time data processing and responsiveness.
- Offers flexible data handling for rapid application iteration.
- Supports vector search and advanced query capabilities.
- Ensures robust security for sensitive data.
- Integrates easily with various cloud providers, LLMs, and frameworks.
3. How can I set up and deploy the Lynx model for evaluating AI-generated responses?
To set up and deploy the Lynx model:
- Download the model from Hugging Face.
- Create a new conda environment and install vLLM.
- Run the Lynx model on a server using vLLM.
- Use the OpenAI API specification to send requests and evaluate AI-generated responses.
- Note that the Lynx-70B model requires significant computational resources (typically an A100 or H100 GPU), while the smaller Lynx-8B-Instruct model can be run on a local system using Ollama.
4. How does this system improve the reliability of generative AI applications?
This system improves reliability by:
- Detecting and mitigating hallucinations in real time.
- Providing immediate feedback on AI response faithfulness.
- Offering comprehensive benchmarking with real-world data.
- Enabling continuous monitoring and evaluation of content in RAG systems.
- Eliminating the need for manual annotations in the evaluation process.
Top Comments in Forums
There are no comments on this article yet.