Implementing Robust RAG Pipelines: Integrating Google's Gemma 2 (2B) Open Model, MongoDB, and LLM Evaluation Techniques
Rate this article
During the early days of the generative AI era, a period defined by the emergence of large language models (LLMs) capable of advanced response synthesis, reasoning, and planning capabilities, the observable trend was an increase in the number of parameters within newly released models. GPT-3 was released with 175 billion parameters in 2020, marking a significant leap from its predecessor, GPT-2, which had 1.5 billion parameters. This trend continued with even larger models like GPT-4, estimated to have over a trillion parameters, although OpenAI has yet to disclose the exact number publicly.
However, a counteracting effort was also taking place to reduce the sizes of these language models while maintaining the capabilities and qualities of similarly larger-sized LLMs. Google is a player on both sides of the language model sizing effort. Gemini is Google’s multimodal LLM, released in December 2023. Several versions were released after offering the emergent abilities of large language models that span billions of parameters and have extensive context window sizes for inputs to the LLM.
However, this tutorial's key focus is Gemma 2, especially the variant with just two billion parameters.
As Google described, Gemma “is a family of lightweight, state-of-the-art open models” built with the same building blocks used to create the Gemini models. Small language and open models have their place in various use cases that call for efficiency, cost-effectiveness, privacy, and latency.
Below are some examples of use cases where Gemma models will be viable for developers.
Industry | Use case | Description | Key benefit |
Finance | Personalized budgeting assistant | A mobile app feature that analyzes spending patterns and offers tailored financial advice to users | A smaller model like Gemma 2B is perfect here — it's lightweight enough to run on a smartphone, respects user privacy by processing data locally, and can provide quick, contextual responses without sending sensitive financial data to a cloud server. |
Healthcare | Medical symptom checker | A web application for initial patient triage, understanding user-described symptoms and suggesting potential causes or urgency levels | Gemma 2B can be fine-tuned on medical datasets to provide quick, accurate initial assessments without expensive infrastructure or compromising patient data privacy. |
Customer service | Intelligent chatbot | An advanced customer service chatbot for a SaaS company that understands context and responds more naturally to customer queries | The model's small variants (2B and 9B) make it easy to deploy and update, allowing quick iterations based on user feedback and new product features. |
This tutorial covers building an asset management analyst assistant using Gemma 2. The assistant can analyze market reports stored in a MongoDB database based on user-provided queries, effectively augmenting some of an analyst's responsibilities.
The tutorial also introduces the topic of LLM evaluation, specifically focusing on assessing key components in a typical retrieval-augmented generation (RAG) pipeline.
What’s covered in this tutorial
- Building an asset management analyst assistant with Gemma 2
- Implementing a RAG pipeline with MongoDB
- Vector search and semantic retrieval techniques
- Generating responses using the Gemma 2 (2B) model
- LLM evaluation for RAG components
All code and implementation steps in this tutorial can be found on GitHub
The first step is to install all the libraries required to provide the functionalities to build the RAG pipeline components and conduct LLM evaluation for the retrieval and generation components of the RAG pipeline.
- PyMongo: Facilitates interaction with MongoDB databases
- Pandas: Provides powerful data manipulation and analysis capabilities
- Hugging Face (datasets): Offers access to a vast collection of pre-processed datasets
- Hugging Face (Sentence Transformers): Provides access to sentence, text, and image embeddings
- DeepEvals: Provides metrics for assessing LLM performance across various dimensions
Execute the following command to install and upgrade the required libraries:
1 !pip install --upgrade --quiet datasets pandas pymongo sentence_transformers deepevals
Setting up environment variables is crucial to ensuring secure access to external services. This practice enhances security by keeping sensitive information out of your codebase. Ensure you have a Hugging Face token(HF_TOKEN) in your development environment before running the code below.
1 import os 2 import getpass 3 # Make sure you have an Hugging Face token(HF_TOKEN) in your development environment before running the code below 4 # How to get a token: https://huggingface.co/docs/hub/en/security-tokens 5 os.environ["HF_TOKEN"] = getpass.getpass("Enter your Hugging Face token: ") 6 os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ") # Need this for the evaluation step
This step loads a financial dataset from Hugging Face into the current development environment. The data preparation process then creates a new attribute for each data point by combining several existing attributes. This combined attribute is prepared for the subsequent embedding generation step.
Data loading and preparation are processes in most data-related tasks and processing pipelines. Data loading refers to obtaining data from an external source and importing it into the current development environment for further downstream processing. Data preparation refers to organizing a dataset's content in a format required for downstream processes. In this tutorial, data is loaded from the MongoDB Hugging Face organization.
The dataset is designed for financial analysis and market research in the tech sector. It provides a comprehensive view of each company, including news, financial metrics, and analyst reports. It could be used for tasks such as sentiment analysis, market trend analysis, or as part of a financial recommendation system.
1 import pandas as pd 2 from datasets import load_dataset 3 4 # Make sure you have an Hugging Face token(HF_TOKEN) in your development environment before running the code below 5 # How to get a token: https://huggingface.co/docs/hub/en/security-tokens 6 # https://huggingface.co/datasets/MongoDB/fake_tech_companies_market_reports 7 dataset = load_dataset("MongoDB/fake_tech_companies_market_reports", split="train", streaming=True) 8 dataset_df = dataset.take(100) 9 10 # Convert the dataset to a pandas dataframe 11 dataset_df = pd.DataFrame(dataset_df) 12 dataset_df.head(5)
Here are the steps taken in the code snippet above:
- The
load_dataset
function is used to fetch the"fake_tech_companies_market_reports"
dataset from the MongoDB organization on Hugging Face. - The " train" split is specified and is typically used for the main dataset in machine learning tasks.
- The
streaming=True
parameter is crucial here. It enables lazy dataset loading, allowing us to work with potentially very large datasets without loading everything into memory at once. `dataset_df = - dataset.take(100)
uses the
take` method to extract the first 100 samples from the dataset. - After extracting a manageable subset of the data, we use Pandas to create a DataFrame object of the data for efficient data manipulation and visualization.
- We use Pandas' ' head' method to understand the dataset's structure and content.
The operations above are steps taken to load data into the development environment and provide domain-specific data for the problem we are trying to solve. Specifically, we are creating a RAG-enabled chatbot that provides market information on tech-based companies for the asset management use case. The second half of this section focuses on data preparation.
A crucial part of the RAG pipeline process is data embedding, which creates a semantic representation of each data point that can be retrieved using information retrieval techniques such as vector search. Understanding which data attributes best capture the semantics, or in simpler terms, the meaning of a single data point, is crucial for building accurate and reliable RAG applications. Therefore, practitioners must carefully select and preprocess the most relevant features for embedding.
This involves:
- Conducting thorough exploratory data analysis to understand the dataset's structure and content.
- Identifying key attributes that best represent the semantic essence of each data point.
- Considering domain-specific knowledge to guide feature selection and engineering.
- Evaluate and fine-tune the embedding process regularly to ensure it accurately captures the nuanced meanings and relationships within the data.
For our use case and this tutorial, the combination of the data attributes report, company, and sector suffices to capture the semantic detail of a data point. The code snippet below shows the operation of selecting the attributes and then defining the function
combine_attributes
to return the concatenated string of the selected attributes.1 # Data Preparation 2 def combine_attributes(row): 3 combined = f"{row['company']} {row['sector']} " 4 5 # Add report information 6 for report in row['reports']: 7 combined += f"{report['year']} {report['title']} {report['author']} {report['content']} " 8 9 # Add recent news information 10 for news in row['recent_news']: 11 combined += f"{news['headline']} {news['summary']} " 12 13 return combined.strip()
The next process applies the
combine_attributes
to each datapoint in the dataset and stores the result of the operation in a new data point attribute: combined_attributes
. This will be the attribute that is passed into the embedding model.1 # Add the new column 'combined_attributes' 2 dataset_df['combined_attributes'] = dataset_df.apply(combine_attributes, axis=1)
Be aware of the context window size of the embedding model you use. These models typically have a finite input capacity, often measured in tokens or characters. When input exceeds this threshold, the model may truncate the excess, potentially resulting in significant information loss. This truncation can severely impact the quality and relevance of the generated embeddings, compromising the overall performance of your RAG system.
If you have extensive data to embed, then it’s worth exploring varying chunking strategies. The standard advice for practitioners here is to implement chunking algorithms for longer documents that divide the text into semantically coherent segments, each fitting within the model's context window and considering overlapping chunks to maintain contextual continuity.
For this use case and tutorial, the leveraged chunking strategy is to chunk any data that extends beyond the maximum input token length of the embedding model and attach the same metadata to the initial and secondary chunks.
The GTE (General Text Embeddings) model series, developed by the Institute for Intelligent Computing, Alibaba Group, provides state-of-the-art performance in text embedding technology. For this tutorial, we focus on using the GTE-Large English V1.5 model, which offers an optimal balance between performance and computational efficiency. This embedding model achieves a score of 65.39 on the Massive Text Embedding Benchmark (MTEB) for English, indicating overall embedding performance across a diverse set of NLP tasks and datasets. For the use case covered in this tutorial that involves information retrieval tasks, which means finding relevant documents for an input query, the GTE-Large English V1.5 model demonstrates a good performance with an average retrieval score of 57.91 based on an evaluation across 15 datasets.
1 from sentence_transformers import SentenceTransformer 2 from tqdm import tqdm 3 import numpy as np 4 5 6 # https://huggingface.co/thenlper/gte-large 7 # embedding_model = SentenceTransformer('thenlper/gte-large') 8 embedding_model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True) 9 10 11 # Determine the maximum sequence length for the model 12 max_seq_length = embedding_model.max_seq_length 13 14 15 16 17 def chunk_text(text, tokenizer, max_length=8192, overlap=50): 18 """ 19 Split the text into overlapping chunks based on token length. 20 """ 21 tokens = tokenizer.tokenize(text) 22 chunks = [] 23 for i in range(0, len(tokens), max_length - overlap): 24 chunk_tokens = tokens[i:i + max_length] 25 chunk = tokenizer.convert_tokens_to_string(chunk_tokens) 26 chunks.append(chunk) 27 return chunks 28 29 30 def get_embedding(text: str) -> list[float]: 31 if not text.strip(): 32 print("Attempted to get embedding for empty text.") 33 return [] 34 35 36 # Get the tokenizer from the model 37 tokenizer = embedding_model.tokenizer 38 39 40 # Split text into chunks if it's too long 41 chunks = chunk_text(text, tokenizer, max_length=max_seq_length) 42 43 if len(chunks) == 1: 44 # If text fits in one chunk, embed as usual 45 embedding = embedding_model.encode(text) 46 else: 47 # If text was split, embed each chunk and average the results 48 chunk_embeddings = embedding_model.encode(chunks) 49 embedding = np.mean(chunk_embeddings, axis=0) 50 51 return embedding.tolist() 52 53 # Apply the embedding function with a progress bar 54 tqdm.pandas(desc="Generating embeddings") 55 dataset_df["embedding"] = dataset_df['combined_attributes'].progress_apply(get_embedding)
The code snippet above performs the following operations:
Chunk the input data intended to be passed into the embedding model if it exceeds a particular length:
- Utilize the
chunk_text
function to tokenize and split long texts. - Check the text split is within the model's
max_seq_length
(8192 tokens for GTE-Large English V1.5). - Implement an overlap of 50 tokens between chunks to maintain context continuity.
Pass the chunk(s) into the embedding model:
- For single chunks, directly encode using
embedding_model.encode(text)
. - For multiple chunks, encode each chunk separately.
- Handle potential empty inputs, returning an empty list if any empty inputs are present.
Store the embedding data as a new attribute
embedding
for each data point:- Use tqdm to display a progress bar for all data points during the embedding process.
- Add the resulting embeddings as a new 'embedding' column in the
dataset_df
DataFrame.
In this tutorial and many RAG applications, MongoDB acts as an operational and a vector database. MongoDB Atlas specifically provides a database solution that efficiently stores, queries, and retrieves vector embeddings.
Creating a database and collection within MongoDB is made simple with MongoDB Atlas.
- Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
- Create the database:
asset_management_use_case
. - Within the database
asset_management_use_case
, create the collectionmarket_reports
. - Create a vector search index named vector_index for the ‘listings_reviews’ collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of the data collection vector search index.
Your vector search index created on MongoDB Atlas should look like the following:
1 { 2 "fields": [ 3 { 4 "numDimensions": 1024, 5 "path": "embedding", 6 "similarity": "cosine", 7 "type": "vector" 8 } 9 ] 10 }
Follow MongoDB’s steps to get the connection string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.
1 os.environ["MONGO_URI"] = getpass.getpass("Enter your MongoDB URI: ")
1 import pymongo 2 3 4 def get_mongo_client(mongo_uri): 5 """Establish and validate connection to the MongoDB.""" 6 7 8 client = pymongo.MongoClient(mongo_uri, appname="devrel.showcase.rag.cohere_mongodb.python") 9 10 11 # Validate the connection 12 ping_result = client.admin.command('ping') 13 if ping_result.get('ok') == 1.0: 14 # Connection successful 15 print("Connection to MongoDB successful") 16 return client 17 else: 18 print("Connection to MongoDB failed") 19 return None 20 21 22 MONGO_URI = os.environ["MONGO_URI"] 23 24 25 if not MONGO_URI: 26 print("MONGO_URI not set in environment variables") 27 28 29 mongo_client = get_mongo_client(MONGO_URI) 30 31 32 DB_NAME = "asset_management_use_case" 33 COLLECTION_NAME = "market_reports" 34 35 36 db = mongo_client.get_database(DB_NAME) 37 collection = db.get_collection(COLLECTION_NAME)
For this tutorial, to ensure we are working with a clean collection, you can run the code below to clear the collection of any existing data.
1 # Delete any existing records in the collection 2 collection.delete_many({})
MongoDB's document model and its compatibility with Python dictionaries offer several benefits for data ingestion. MongoDB's document-oriented structure offers several advantages for data storage and manipulation in Python applications. The core of this structure is the use of BSON (Binary JSON) for data storage, which aligns naturally with Python's dictionary data type. This alignment facilitates data representation using key-value pairs, making it intuitive for Python developers to work with MongoDB.
One of MongoDB's key features is its schema flexibility. Unlike traditional relational databases, MongoDB is schema-less, allowing each document in a collection to have a different structure. This flexibility is particularly advantageous in Python environments, as it complements Python's dynamic nature. Developers can ingest varied data structures without predefined schemas, offering greater data handling and storage adaptability.
Another significant benefit of working with Python is MongoDB's data ingestion efficiency. The close similarity between Python dictionaries and MongoDB documents enables direct data ingestion without complex transformations. This streamlined process results in faster data insertion and reduced processing overhead, making MongoDB an excellent choice for applications that require rapid data storage and retrieval in Python-based systems.
And that’s why the ingestion process within this tutorial is completed in one or two lines:
1 documents = dataset_df.to_dict('records') 2 collection.insert_many(documents) 3 print("Data ingestion into MongoDB completed")
MongoDB's query language is designed to work well with document structures, making it easy to query and manipulate ingested data using familiar Python-like syntax. The query language is executed using MongoDB's aggregation pipeline, a powerful feature that allows for complex data processing and analysis within the database.
An aggregation pipeline can be thought of similarly to pipelines in data engineering or machine learning, where processes operate sequentially. Each stage takes an input, performs operations, and provides an output for the next stage.
Stages are the building blocks of an aggregation pipeline. Each stage represents a specific data transformation or analysis operation. Common stages include:
$match
: Filters documents (similar to WHERE in SQL)
$group
: Groups documents by specified fields
$sort
: Sorts the documents
$project
: Reshapes documents (select, rename, compute fields)
$limit
: Limits the number of documents
$unwind
: Deconstructs array fields
$lookup
: Performs left outer joins with other collections
The code snippet below defines a
vector search
function demonstrating semantic search implementation using MongoDB's vector search capabilities. At its core, the function leverages dense vector embeddings to find documents semantically similar to a user's query. It begins by converting the input query into a vector embedding, then utilizes MongoDB's $vectorSearch
operator to search through a pre-indexed collection of document embeddings efficiently.1 def vector_search(user_query, collection): 2 """ 3 Perform a vector search in the MongoDB collection based on the user query. 4 5 6 Args: 7 user_query (str): The user's query string. 8 collection (MongoCollection): The MongoDB collection to search. 9 10 11 Returns: 12 list: A list of matching documents. 13 """ 14 15 16 # Generate embedding for the user query 17 query_embedding = get_embedding(user_query) 18 19 20 if query_embedding is None: 21 return "Invalid query or embedding generation failed." 22 23 24 # Define the vector search pipeline 25 vector_search_stage = { 26 "$vectorSearch": { 27 "index": "vector_index", 28 "queryVector": query_embedding, 29 "path": "embedding", 30 "numCandidates": 150, # Number of candidate matches to consider 31 "limit": 2 # Return top 4 matches 32 } 33 } 34 35 36 unset_stage = { 37 "$unset": "embedding" # Exclude the 'embedding' field from the results 38 } 39 40 41 project_stage = { 42 "$project": { 43 "_id": 0, # Exclude the _id field 44 "company": 1, # Include the plot field 45 "reports": 1, # Include the title field 46 "combined_attributes": 1, # Include the genres field 47 "score": { 48 "$meta": "vectorSearchScore" # Include the search score 49 } 50 } 51 } 52 53 54 pipeline = [vector_search_stage, unset_stage, project_stage] 55 56 57 # Execute the search 58 results = collection.aggregate(pipeline) 59 return list(results)
The code snippet above does the following operations:
- Converts the user's text query into vector embedding using the
get_embedding
function, which leverages the earlier embedding operations defined earlier - Utilizes MongoDB's
$vectorSearch
operator to find semantically similar documents - Searches the
vector_index
using the query embedding; ensures that the vector search index has been created on MongoDB atlas and is referenced in the query using the name specified at creation - Sets
numCandidates
to 150 for broad initial matching - Limits the final results to the top two matches
- Uses an
$unset
stage to remove the "embedding" field from results, reducing data transfer - Uses a
$project
stage to selectively include relevant fields: "company", "reports", and "combined_attributes" - Adds a similarity score using
$meta: "vectorSearchScore"
- Combines the stages into a single aggregation pipeline
- Executes the search using
collection.aggregate()
- Returns the results as a list of documents
The next step after creating the information retrieval component aspect of the RAG pipeline is to handle the result, which means essentially formatting the retrieved documents in a manner that is consumable by the LLM in the generation step of the RAG pipeline.
In the code snippet below, the
get_search_result
function serves as a wrapper for the vector search operation, transforming raw search results into a more user-friendly format. The function returns a formatted string containing summarized information from the top search results.1 def get_search_result(query, collection): 2 3 4 get_knowledge = vector_search(query, collection) 5 6 7 search_result = '' 8 for result in get_knowledge: 9 search_result += f"Company: {result.get('company', 'N/A')}, Combined Attributes: {result.get('combined_attributes', 'N/A')}\n" 10 11 12 return search_result
The next step is to actually define an input query for the RAG pipeline. This input query is similar to what is expected for chatbot interfaces that expect prompts or inputs from users and provide a response. For the use case in this tutorial, the code snippet below demonstrates the practical application of the semantic search functionality in a question-answering context.
1 # Conduct query with the retrieval of sources 2 query = "What companies have negative market reports or negative sentiment that might deter from investment in the long term" 3 source_information = get_search_result(query, collection) 4 combined_information = f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}." 5 6 print(combined_information)
This section of the tutorial demonstrates the initialization of Gemma 2.0 (2B), a two billion-parameter open language model and its associated tokenizer. This particular variant of the model is instruction-tuned, which means it has been specifically fine-tuned on a dataset of instructions and responses, enhancing its ability to understand and follow user prompts accurately.
The code snippet below uses the Hugging Face transformers library to load the tokenizers and model for the Gemma-2.0-2b, using the
AutoTokenizer
and AutoModelForCausalLM
modules of the library. The tokenizer converts text into tokens that can be passed as inputs to be processed by the model.1 import torch 2 from transformers import AutoTokenizer, AutoModelForCausalLM 3 4 5 tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") 6 model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it", torch_dtype=torch.bfloat16)
The code snippet below extracts the response from the Gemma 2 (2B) model.
1 def extract_model_response(response): 2 # Split the response at the start of the model's turn 3 parts = response.split("<start_of_turn>model") 4 # If there's a model response, it will be in the last part 5 if len(parts) > 1: 6 model_response = parts[-1].strip() 7 8 # Remove any potential end-of-turn markers 9 model_response = model_response.split("<end_of_turn>")[0].strip() 10 11 return model_response 12 else: 13 return "No model response found."
After initializing the Gemma 2.0 (2B) model and its associated tokenizer, the next step is to generate a response using a chat-style input format. This tutorial leverages the model's chat template functionality to create a more conversational experience. The Gemma model tokenizer's
apply_chat_template
method is utilized to properly format the input for the instruction-tuned model.1 chat = [ 2 { "role": "user", "content": combined_information }, 3 ] 4 prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) 5 6 7 inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") 8 outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150, do_sample=True, temperature=0.7) 9 10 11 response = tokenizer.decode(outputs[0])
The code snippet above performs the following operations:
- Creation of chat-style input: A single user message containing the
combined_information
is formatted as a chat-style input. - Application of chat template: The tokenizer's
apply_chat_template
method formats this input according to Gemma's specific chat template. This process includes adding a generation prompt to guide the model's response. - Tokenization and encoding: The formatted prompt is tokenized and encoded, preparing it for input into the model.
- Response generation: The model generates a response using specified parameters, such as
maximum_new_tokens
andtemperature
. These parameters control the length and randomness of the output. - Decoding and extraction: Finally, the generated output is decoded and processed to extract the model's response, making it ready for response extraction and other downstream processes.
1 model_output = extract_model_response(response) 2 print(model_output)
LLM evaluation, also referred to as LLM evals, is the systematic process of formulating a profile of foundation models or their derived fine-tuned variants to understand and capture their performance on certain specialized or general-purpose tasks, reliability in certain conditions, effectiveness in particular use cases, and many other evaluative measurement criteria that help in gaining an overview of a model’s overall ability.
The particular category of LLM evaluation performed in this tutorial is LLM system evaluation, defined as the end-to-end performance overview of a system or infrastructure that incorporates an LLM; examples of such systems are RAG pipelines and agentic systems.
Although the RAG pipeline has multiple components, this tutorial will only evaluate the generation component. The generation component involves using the retrieved relevant information from the database to produce a coherent and accurate response to the user's query. Specifically, it includes:
- Integrating the retrieved context(
source_information
) with the original query (query
). - Prompting the language model with this combined information (
combined_information
). - Generating a response(
model_output
) that addresses the user's question or request. - Ensuring the generated content is relevant, coherent, and faithful to the retrieved information. This is what we are doing in this section.
This component is crucial as it determines the final output quality of the RAG system, directly impacting user experience and the system's overall effectiveness. The evaluation metrics and factors for the generation component of a RAG pipeline are relevancy, faithfulness, coherence, and accuracy.
The LLM evaluation framework leveraged in this tutorial is DeepEval. DeepEval provides the evaluation metrics: answer relevance, faithfulness, hallucination, and others, as well as the ability to define custom metrics. This tutorial will utilize just two metrics: answer relevance and faithfulness.
Faithfulness: Faithfulness quantifies the extent to which the factual information in the generated text aligns with the retrieved documents or provided context. Faithfulness is measured in the DeepEval library by using an LLM to extract claims that are made in the model’s response to a user query and provided context and then using the same LLM to determine if the claims extracted from the model’s response are observed in the retrieved context. Therefore, the faithfulness score is determined by the number of “backed” claims in the model’s response divided by the total number of claims in the model’s response.
1 faithfulness = (Number of truthful claims) / (Total number of claims in response)
To begin the evaluation step, ensure that the DeepEval Python library is installed within your development environment. Next, import the
LLMTestCase
and FaithfulnessMetric
modules from the DeepEval library.LLMTestCase
FaithfulnessMetric
1 from deepeval.test_case import LLMTestCase 2 from deepeval.metrics import FaithfulnessMetric
1 from deepeval.metrics import FaithfulnessMetric 2 3 4 actual_output = model_output 5 6 7 retrieval_context = [source_information] 8 9 10 metric = FaithfulnessMetric( 11 threshold=0.7, 12 model="gpt-4", 13 include_reason=True 14 ) 15 test_case = LLMTestCase( 16 input=query, 17 actual_output=actual_output, 18 retrieval_context=retrieval_context 19 ) 20 21 22 metric.measure(test_case) 23 print(metric.score) 24 print(metric.reason)
The code snippet above sets up variables
actual_output
and retrieval_context
to store the model's output and the source information for context. A FaithfulnessMetric
object is instantiated with a threshold of 0.7, corresponding to the faithfulness boundary, resulting in a flagging in the evaluation output for claims that fall below the threshold.The evaluation procedure uses "GPT-4" as the evaluation model and is configured to include a reason for its assessment, using the LLM as a judge approach for evaluation. An
LLMTestCase
is constructed, encompassing the input query, the model's actual output, and the retrieval context. The code snippet shows the invocation of the metric's measure
method, passing the test case as an argument. Finally, the code snippet outputs the calculated faithfulness score and the accompanying explanation for the evaluation.Below is an example of the output of the faithfulness evaluation process.
1 Event loop is already running. Applying nest_asyncio patch to allow async execution... 2 0.9333333333333333 3 The score is 0.93 because there is a minor discrepancy with the year of CDDY's strategic acquisitions. The actual output mistakenly indicates they occurred in 2023, while the context clearly states they happened in 2024.
The faithfulness score is 0.93, which can be interpreted as Gemma 2B's response being 93% faithful to the provided context. The accompanying explanation also provides a reason for the provided score, a response generated by the LLM used for evaluation to provide a textual explanation of its result. This component of the DeepEval library introduces a human-readable explanation feature, which enhances the interpretability of the evaluation results. It provides a clear rationale for the faithfulness score, highlighting discrepancies or alignments between the model's output and the source context.
Answer relevance: This quantifies how well the generated response aligns with the initial input query. Answer relevance assesses the association between the response and the query without evaluating factual accuracy.
1 from deepeval.metrics import AnswerRelevancyMetric 2 3 4 actual_output = model_output 5 6 7 metric = AnswerRelevancyMetric( 8 threshold=0.7, 9 model="gpt-4", 10 include_reason=True 11 ) 12 13 14 test_case = LLMTestCase( 15 input=query, 16 actual_output=actual_output 17 ) 18 19 metric.measure(test_case) 20 print(metric.score) 21 print(metric.reason)
The result of the answer relevance evaluative process is as follows:
1 Event loop is already running. Applying nest_asyncio patch to allow async execution... 2 1.0 3 The score is 1.00 because the response accurately selected a company from the provided information and justified why it's a safe long-term investment, addressing all aspects of the input.
In this section, you were introduced to LLM evaluation, specifically focusing on LLM system evaluation within the context of a RAG pipeline. You implemented two key evaluative methods for the generation component of a RAG pipeline: answer relevance and faithfulness, utilizing the powerful deepevals library.
By leveraging the DeepEval library, you could quantify these aspects of the model's performance and gain valuable insights into its strengths and areas for improvement. The library's ability to provide numerical scores and human-readable explanations enhances the interpretability of your evaluation results.
In this tutorial, you explored building an asset management analyst assistant using Google's Gemma 2 (2B) open model, a RAG pipeline with MongoDB, and LLM evaluation techniques. This showcase demonstrates the potential of open models in creating efficient, cost-effective AI solutions for specific use cases.
The implementation of Gemma 2 (2B), a lightweight two billion-parameter model, highlights the growing capabilities of open-source models in real-world applications and environments with limited compute availability. This allows developers to create AI-driven solutions that balance performance with resource efficiency.
MongoDB's operational and vector database role underscores its flexibility and scalability in modern AI applications and infrastructure. Its vector search capabilities and Python integration make it well-suited for RAG systems, enabling efficient information retrieval.
The focus on LLM evaluation using the DeepEval library emphasizes the importance of assessing AI system performance. By implementing metrics like faithfulness and answer relevance, you gained insights into the model's strengths and areas for improvement.
Moving forward, you can explore more evaluation approaches or chunking strategies to optimize the performance and accuracy of RAG pipelines.
1. What is Gemma 2, and how does it differ from other language models?
Gemma 2 is a family of lightweight, state-of-the-art open models developed by Google. It's built using the same building blocks as the larger Gemini models but is designed to be more efficient and suitable for use cases that require lower computational resources. The Gemma 2 (2B) variant has two billion parameters and is particularly useful for applications that need efficiency, cost-effectiveness, privacy, and low latency.
2. How does MongoDB contribute to the RAG pipeline?
MongoDB serves as both an operational and vector database in this RAG application. It efficiently stores, queries, and retrieves vector embeddings. The tutorial demonstrates how to set up a MongoDB Atlas cluster, create a vector search index, and use MongoDB's aggregation pipeline for semantic search capabilities, which are crucial for the retrieval component of the RAG system.
3. What is LLM evaluation, and which metrics are used in this tutorial?
LLM evaluation, or LLM evals, is the systematic process of assessing language models' performance on various tasks and use cases. In this tutorial, the focus is on LLM system evaluation, specifically for a RAG pipeline. Two key metrics are used: faithfulness and answer relevance. Faithfulness measures how well the generated response aligns with the provided context, while answer relevance assesses how well the response addresses the initial query.
4. What is the purpose of the DeepEval library in this tutorial?
The DeepEval library is used to conduct LLM evaluations. It provides metrics such as FaithfulnessMetric and AnswerRelevancyMetric to assess the quality of the generated responses. The library allows for setting evaluation thresholds and using models like GPT-4. It includes features for providing human-readable explanations of the evaluation results, enhancing the interpretability of the assessment process.
Top Comments in Forums
There are no comments on this article yet.