BlogRun AI wherever your compliance framework demands. Read blog >

BlogRetrieval accuracy is now a competitive advantage Read blog >

Foundational Guide to LLM Fine-Tuning

Out-of-the-box, pre-trained large language models (LLMs) are generic. But if you’re going to use one for production-grade applications, you’ll need to fine-tune it to capture your specific domain, tasks, or style. In this article, we’ll discuss fine-tuning techniques and how MongoDB Atlas, along with VoyageAI, go hand in hand to build contextually aware fine-tuned systems.

Key takeaways

While many large language models (LLMs) have made strides in contextual, domain-specific knowledge, some out of the box solutions may not provide enough value without fine-tuning.
With fine-tuning, an LLM can be trained to work with specific tasks, styles, or domains.
There are several techniques for fine-tuning, including supervised fine-tuning and reinforcement fine-tuning—with each modifying parameters differently.
VoyageAI plays a critical role in creating context-rich vector embeddings for various unstructured inputs, which may make the fine-tuning process easier.
It’s important to follow the best practices to get the most out of a fine-tuned model.

Table of contents

What is LLM fine-tuning?
Why fine-tune a pre-trained model?
RAG vs. fine-tuning
Generic answer vs. fine-tuned answer
Fine-tuning techniques
How to fine-tune an LLM
Example application (task) using MongoDB Atlas and VoyageAI
Fine-tuning best practices
Conclusion
Glossary
FAQs
Related resources

What is LLM fine-tuning?

Large language models (LLMs) usually deliver responses based on the broad-based content it was trained on. However, if you need answers that cater to your specific task, domain, or brand, the performance of an out-of-the-box LLM might be suboptimal. To customize and expand the performance of a model in terms of its output, developers typically start with prompt engineering or retrieval augmented generation (RAG), where the system pulls in external facts to provide context. But even after these two steps, it’s still possible that the system can be further improved with fine-tuning, especially if it still struggles with reasoning or the required format.

Why fine-tune a pre-trained model?

The most important reason to fine-tune a model is to add domain knowledge about the topics or fields your business uses. A small language model (a fine-tuned subset of LLM) can perform better than a larger base model to get context-based responses because it has been adapted to the specific patterns and requirements of the target domain.

Proprietary information, style, and bias

Another reason why you should fine-tune an LLM is to train the model on confidential information. You can fine-tune an LLM to cater to specific workflows, brand voice, and basic proprietary information. Rather than building your own LLM from scratch, you can use the functionalities of the existing LLM to improve the usability of the model while keeping your data private. Fine-tuning a model can also help remediate bias.

Some important use cases of fine-tuning an LLM are:

Sentiment analysis.
Domain specific question answering.
Named entity recognition.
Social media posts.

RAG vs. fine-tuning

RAG extends the knowledge base for the LLMs by connecting them to external information sources. RAG can complement a fine-tuned model by offering dynamic retrieval of the latest information, while fine-tuning provides the benefits of adapting the model’s parameters to a domain or a task specific dataset. You can also use RAG to supply missing information from a fine-tuning dataset, which helps correct responses when the fine-tuned data alone is insufficient.

The diagram below shows an example of how RAG and fine-tuning can be used together.

How RAG and fine-tuning work together to generate the most relevant result

Generic answer vs. fine-tuned answer

The diagram below illustrates the difference between a generic answer and a fine-tuned answer to this query: “Explain MongoDB sharding for a beginner.”

A generic LLM with broad language understanding answers in a general way: “Sharding in MongoDB is a way of distributing data across multiple servers to handle large datasets.”

A fine-tuned LLM adapts to a unique style with analogy that a beginner will understand: “Imagine you have a library (dataset) so big that one building (server) can’t hold all the books. MongoDB solves this by creating several smaller libraries (shards), each storing part of the collection. When you search for a book, MongoDB knows exactly which library to check, so you don’t waste time looking everywhere.”

Example illustration of response of a fine-tuned LLM

Fine-tuning techniques

Fine‑tuning techniques adapt LLMs through supervised learning, parameter‑efficient methods (PEFT), and reinforcement learning. Supervised fine‑tuning and PEFT models use labeled data and reinforcement learning to optimize LLMs, which helps align outputs with human preferences and desired behaviors.

Some standard fine-tuning methods are listed below.

Supervised fine-tuning

Supervised fine-tuning techniques use custom, labeled data on pre-trained models to perform specialized tasks. For example, a model built specifically for a medical system over the base GPT or Llama model will be trained on a medical knowledge base to understand specific user questions better. To accomplish this, it’s trained on datasets that are relevant to the questions asked by medical system users, like: “What are the symptoms of glaucoma?” or “How to diagnose chronic myelogenous leukemia?”

Different fine-tuning methods update the model parameters (weights) in different ways, let’s understand about each of them in the following few paragraphs.

Full fine-tuning

A full LLM fine-tuning method starts from a pre-trained model and updates all of its parameters. Although full fine-tuning yields the most accurate results, it’s very costly because there can be millions of parameters across different layers of the model and the model updates them all.

Instruction fine-tuning

Instruction fine-tuning is used for making the model better at responding to queries. In this, you build a test set of instructions through prompts and the expected outputs. The instruction tuning model learns to follow instructions and think in a new way. For example, completing a sentence, or guessing the next best word or sentence.

Parameter efficient fine-tuning (PEFT)

In this technique, only a subset of parameters are modified to fine-tune a model, adjusting only the top layer task-specific or domain-specific parameters.

There are two techniques for PEFT:

LoRA (low-rank adaptation)

Rather than updating all the model weights from the pre existing model, LoRA inserts small low‑rank adapter matrices into specific layers that adjust for the weights of the base model, which when applied to the new inputs, gives the desired output. This reduces memory and computational overhead, while giving optimal performance, as the original parameters remain unchanged, reducing the number of trainable parameters during the training process.

QLoRA (quantized low-rank adaptation)

QLoRA is a modification of LoRA, where the parameters are also quantized to a lower precision for example from 32-bits to 16-bits to further reduce the storage overhead.

Reinforcement fine-tuning (RFT)

This technique uses a grading system to evaluate multiple responses of the model, and the model improves itself iteratively. RFT does not require labelled training data, and works on a reward system, where the best output is rewarded by the grader and reinforced, thus improving the output with each iteration.

How to fine-tune an LLM

Fine-tuning requires following a set of processes:

Choose a pre-trained language model (you can use huggingface for datasets).
Gather a high data quality dataset (relevant to the task for which we are fine-tuning the model) and process the data by cleaning, tokenization, and splitting the dataset into training data and validation data sets.
Generate a fine-tuned model using any of the training methodologies.
Evaluate the model’s performance using the validation data set.

Example application using MongoDB Atlas and VoyageAI

TARGET TASK: To generate weekly posts on LinkedIn on the topic “Trending in AI.”

We know that developments in the field of AI are continuous and dynamic. Our base pre-trained model contains a lot of information about basic question answering, such as grammar, vocabulary, semantic information, and example answers to various questions.

For the model to know a particular style of writing, you need to feed it previous articles or posts which make it understand that style—let’s say a friendly, casual, and witty tone with an occasional use of sarcasm and so on. Based on the inputs, the model can be fine-tuned to give outputs that match that style of writing. This fine-tuned model can then be evaluated and used in various LLM applications.

When you prompt an LLM application to write a post on a particular topic, the application converts the prompt into embeddings (using an embedding model like VoyageAI) and then tries to match them with the embeddings already stored in the database (in this case, MongoDB Atlas).

To further enhance the quality of the output by providing more context-rich data, the dynamic knowledge integration can be done through external sources. This data can also be embedded using VoyageAI and stored in MongoDB Atlas. The biggest advantage of using MongoDB Atlas is that it stores the generated vector embeddings alongside the data, making the retrieval easy and efficient.

The database provides the retrieved context, and the fine tuned model then uses this context to generate the final response in the user’s tone and style.

VoyageAI

VoyageAI is a multi-modal embedding model that creates vector representations of the data. Vector representations are key to compare data based on their semantic meaning. Vector embeddings can often provide high quality similarity information between query data and stored data, compared to full text search alone.

VoyageAI creates accurate embeddings for better retrieval, whether it is from a knowledge source (RAG) or the user prompt. The speciality of VoyageAI as compared to other embedding models is that it can generate embeddings for any kind of data, like text, image, text+image, documents, and so on.

Fine-tuning best practices

To improve the accuracy of your fine-tuned model, you must follow a few important tips:

Choose the right base model, ask questions like:

What is the purpose of creating the model?
What is the cost involved?
Do we have the right expertise to build the model?

Prioritize access to high quality, clean and bias-free data.

Gather reliable data from authentic sources.
Spend more time on data preparation as data is the key to getting most accurate responses.

Evaluate the model in different ways.

Do not rely only on metrics or human feedback - use both to clearly understand where the improvement is required

Keep the model open to continuous improvements and feedback,

Conclusion

Fine-tuning uses curated training data to refine their outputs for a specific task that doesn't produce accurate enough results using generic LLM. The techniques used for fine-tuning improve the model’s output to ensure better performance and accuracy while reducing bias. MongoDB along with VoyageAI plays a key role in fine-tuning by storing vector embeddings generated by VoyageAI alongside the data, and performing vector search to return relevant context-aware results. VoyageAI can also perform reranking to further improve the relevance of results.

Glossary

Data quantization

As the data size and number of parameters in a model increase, so does the complexity. While having more parameters often improves accuracy, it also raises training costs and demands more hardware resources.

To address this, a compression technique called quantization is applied. Quantization reduces the precision of parameters by lowering the number of bits used to represent them. For example, a weight originally stored as 0.00015678 in 32‑bit floating point (FP32) might be approximated as 0.000157 in 16‑bit representation, depending on the scale of quantization.

This reduction decreases memory requirements, improves performance and cache utilization, and speeds up inference.

However, it comes at the cost of accuracy, since lower precision can introduce rounding errors. Therefore, it’s important to strike a trade‑off between efficiency and accuracy depending on the application.

Low-rank matrix

A low rank matrix is the matrix obtained after matrix decomposition and greatly reduces the memory footprint, i.e., a mxn matrix, can be decomposed to mxr and rxn matrix to reduce the number of trainable variables, through an adapter (trainable module). For example, let's say the original model has 1024x512 = 524,288 parameters. Instead of modifying all the 524,288 parameters, LoRA reduces it to two matrices of 1024x1 and 1x512, thus reducing the trainable parameters to 1024+512= 1536.

Named entity recognition

Named entity recognition is a natural language processing technique that automatically identifies and classifies information like names of people, organizations, dates, locations and so on from a set of unstructured data. It detects correct categorization of entities based on the context of the words. For example, the entity “Orange” as a fruit, and “Orange”, as the company, can be distinguished based on the sentences, “I like to eat an orange everyday” and “Orange has reached a revenue of 40 billion euros in 2024.”

Tokenization

Tokenization is the process of converting text or images into numerical representations called tokens. It involves precise conversion so that the model interprets the correct grammar and semantics of the data. The tokenizer used for fine-tuning should match that of the base model to ensure the vocabulary and grammar are interpreted correctly.

Frequently asked questions

Yes, we can fine-tune LLM to domain or task specific datasets to provide more accurate answers to specialized contexts and tasks.

Some of the best practices for LLM fine-tuning are using high quality, bias-free data, choosing the right base model, and iterating over the response after careful evaluation using various metrics as well as human feedback.

An LLM is first trained on huge amounts of data, and then fine-tuned using methods such as supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), or parameter-efficient techniques like LoRA to adapt it to specific tasks or domains.

Fine-tuning is used to customize LLMs for tasks like domain-specific Q&A, content generation, summarization, sentiment analysis, and adapting outputs to a specific tone or business context.

RAG with MongoDB — Discover how to build RAG applications with MongoDB and earn a skill badge

Using MongoDB Atlas and LLM to build enterprise-grade applications — Learn to create generative AI-enriched apps with MongoDB Atlas

Building an agentic RAG using a Fireworks AI hosted LLM and MongoDB Atlas for retrieval — A practical, in-depth guide to build agentic AI using MongoDB and Fireworks AI

Get started with Atlas today

Get started in seconds. Our free clusters come with 512 MB of storage so you can play around with sample data and get oriented with our platform.

Try FreeContact sales

GET STARTED WITH: