Local-first and Reasoning-enhanced RAG With DeepSeek and MongoDB
Rate this tutorial
DeepSeek’s latest release has taken the AI ecosystem by storm, and for good reason. It has introduced state-of-the-art models and open-sourced them, making frontier AI more accessible than ever. One of these models is a reasoning model comparable to those from the best AI labs—until now, such standalone and open-weight models were rare. As if that weren’t enough, DeepSeek claims to have trained these models at a fraction of the cost of the leading model counterparts, challenging the economics of the industry.
In this article, we will explore what makes DeepSeek’s models unique and how they can be leveraged locally for enhanced retrieval-augmented generation (RAG) with MongoDB.
DeepSeek has three models:
- DeepSeek-V3: General-purpose LLM on-par with GPT-4o and Claude 3.5 Sonnet
- DeepSeek Coder: Specialized LLM focused on generating and completing code
- DeepSeek-R1: Reasoning model comparable to OpenAI-o1
The DeepSeek models are unique in several respects. Let’s start with DeepSeek-V3, which is praised for its scalability and cost-efficiency compared to alternatives due to its architecture of Mixture of Experts (MoE). This means that out of its 671 billion parameters, only 37 billion are activated at a time, depending on the tasks. For this reason, DeepSeek-V3 represents a true cost-efficient alternative to frontier lab models for handling general tasks. This model is the result of multiple optimizations and innovative techniques, including the use of knowledge distillation from another model, DeepSeek-R1, during post-training.
In simple terms, the researchers developed a novel method to enhance a standard model like DeepSeek-V3 by transferring reasoning skills from another reasoning model, DeepSeek-R1. As a result, DeepSeek-V3 becomes significantly better at reasoning while maintaining its general-purpose efficiency and practical usability. This leads us to the DeepSeek-R1 family of models, marking the first generation of reasoning models developed by this AI lab. Like its standard V3 counterpart, R1 is built on an MoE architecture, activating 37 billion parameters out of a total of 671 billion. What sets reasoning models apart is their ability to use chain-of-thought (CoT) “thinking” before generating outputs, significantly improving performance and accuracy in solving logical problems. In addition to open-sourcing the models, the DeepSeek team transparently provided the different methods and techniques behind their technical breakthroughs.
Notably, they successfully trained a model to reason step by step without relying on massive supervised datasets (which are expensive to produce). Instead, they used reinforcement learning with an innovative reward model, enabling the model to autonomously refine its reasoning and allocate the needed computation time based on the query. But that’s not all. While DeepSeek-R1 is a large-scale reasoning model, the team also applied its distillation technique (as described above) to other smaller open-source models, namely Qwen and Llama. By leveraging samples generated by DeepSeek-R1, they fine-tuned smaller models to mimic its reasoning process, significantly enhancing performance. According to DeepSeek benchmarks, these distilled models show remarkable results, with DeepSeek-R1-Distill-Qwen-32B achieving performance comparable to OpenAI-mini-o1. The distilled models, ranging from 1.5 billion to 70 billion parameters, have also been open-sourced by DeepSeek and unlock new use cases by leveraging state-of-the-art reasoning capabilities for a fraction of the cost locally or by self-hosting.
Retrieval-augmented generation is an AI approach that combines document retrieval with generative models to generate more accurate and contextually relevant responses. To learn more about RAG, see Retrieval-Augmented Generation (RAG) with Atlas Vector Search.
A local RAG solution, where the various components of the AI stack run on a local machine or, more broadly, on-premises, is cost-effective and allows all operations to be run locally.
In many cases, such self-hosted solutions are essential, particularly in regulation-heavy industries like healthcare and defense. These solutions address critical security and privacy concerns and provide greater control over the overall infrastructure for organizations that still run their applications on-premises. By releasing more performant and efficient models, DeepSeek opens new avenues for applications, including powerful on-device applications that can run on consumer hardware such as Macbook M1+.In the case of integrating models like DeepSeek-R1 distilled models, using a reasoning model can enhance problem-solving capabilities over technical or research subjects. Moreover, unlike proprietary models such as the OpenAI-o1 family, the DeepSeek-R1 family shows the chain of thought it used to arrive at its conclusions, providing greater explainability and transparency in its outputs.
Let’s now dive into an implementation example. This tutorial aims to build a mini-application that allows you to upload your own PDF files, including your private or sensitive documents, and interact with them through questions and answers. To meet local requirements, the entire stack will be deployed locally using:
- Ollama for running the models, Nomic-Text-Embed for the embedding model, and DeepSeek-R1-Distill-8B (depending on your machine, you could go up to the 24B or 32B).
- A local deployment of MongoDB Atlas for storing documents and vectors, as well as performing vector search.
- LangChain to orchestrate the back end and LLMs.
- Streamlit to create the user interface.
- Data: Any PDF document.
.png&w=3840&q=75)
While this tutorial features a local-first deployment, you have the option to either create a local Atlas deployment by using the Atlas CLI or deploy a cluster on the cloud. The Atlas CLI is the command-line interface for MongoDB Atlas, and you can use the Atlas CLI to interact with Atlas from the terminal for various tasks, including creating local Atlas deployments. To learn more, see Manage Local and Cloud Deployments from the Atlas CLI.Additionally, the different models used in this tutorial are downloaded and run using Ollama, an open-source ecosystem for local LLM development.
To complete this tutorial, you must have the following:
- MongoDB Command Line Database Tools installed
- Ollama installed and running
- Python v3.10 or later
For this tutorial, you’ll want to create a directory where you can save the different files—for example, local-rag-mongodb.
Then, download the following two models from Ollama:
1 ollama pull nomic-embed-text 2 ollama pull deepseek-r1:8b
Run the following command and follow the prompts to create a local deployment.
1 atlas deployments setup
.png&w=3840&q=75)
For detailed instructions, see Create a Local Atlas Deployment.
Alternatively, you can also use Docker to run a local deployment of MongoDB Atlas:
1 docker run -d -e JAVA_TOOL_OPTIONS="-XX:UseSVE=0" -p 27017:27017 mongodb/mongodb-atlas-local
Once running, you can connect to your deployment using:
1 atlas deployments connect
Multiple options are available to connect and manage deployments, including MongoDB Compass for visually interacting with MongoDB Atlas deployments. For more commands to manage Atlas deployments, see Manage a Local Atlas Deployment.
Then, using either MongoDB Shell or MongoDB Compass, let’s create a database,
knowledge_base
, and a collection, documents
.Using MongoDB Shell, run the mongosh command, which should automatically find and connect to your local deployment.
1 mongosh
1 > use knowledgebase 2 switched to db knowledgebase 3 > db.createCollection("documents")`
This step is optional since both the database and collection are created automatically when the first document is added.
Once the environment is set up, and with the models downloaded locally using Ollama and MongoDB Atlas running locally, we can proceed to run the application.
As we noted in the initial list of components required, the application is developed using LangChain for orchestrating LLM requests and integrated with MongoDB Atlas Vector Search and Streamlit for the interface.
1 git clone https://github.com/mongodb-developer/GenAI-Showcase.git cd GenAI-Showcase/apps/local-rag-pdf`
First, we recommend running the application within a Python virtual environment to isolate the project from the global Python environment.
You can use the
venv
package, which comes with your Python installation. 1 python -m venv local-rag-env 2 source local-rag-env/bin/activate
Depending on your operating systems, the Python executable might be
python3
instead of python
.Then, install the required libraries:
pip install -r requirements.txt
Once installed, you can launch the application:
streamlit run app.py
Upload your own PDF or download the DeepSeek-R1 research paper as an example, and start asking questions directly about its content:
.png&w=3840&q=75)
The main parameters are defined in the config.yaml file:
1 llm_model: "deepseek-r1:8b" 2 embedding_model: "nomic-embed-text" 3 mongo_connection_str: "mongodb://localhost:27017/?directConnection=true" 4 database_name: "knowledge_base" 5 collection_name: "documents"
The application is then initialized by retrieving the configuration file variables, such as the different models used via Ollama—here, nomic-embed-text for the embedding model and deepseek-r1:8b for the large language model, as well as the MongoDB connection information, including the database name and collection name.
1 config = load_config(config_file) 2 3 # Read values from config 4 llm_model = config["llm_model"] 5 embedding_model = config["embedding_model"] 6 mongo_connection_str = config["mongo_connection_str"] 7 database_name = config["database_name"] 8 collection_name = config["collection_name"]
The splitting parameters for chunking documents and the prompt used to request the LLM are as follows:
1 self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=100) 2 self.prompt = ChatPromptTemplate.from_template( 3 """ 4 You are a helpful assistant answering questions based on the uploaded document and the conversation. 5 Conversation History: 6 {conversation_history} 7 Context from Documents: 8 {context} 9 Question: {question} 10 """
Then, the vector store connection and its options are set, followed by the creation of the vector search index.
This is needed to perform vector search queries against the vector embeddings based on semantic similarity.
1 # Initialize the vector store with MongoDB Atlas 2 self.vector_store = MongoDBAtlasVectorSearch( 3 collection=self.collection, 4 embedding=self.embeddings, 5 index_name="vector_index", 6 relevance_score_fn="cosine" 7 ) 8 self.vector_store.create_vector_search_index(dimensions=768)
1 docs = PyPDFLoader(file_path=pdf_file_path).load() 2 chunks = self.text_splitter.split_documents(docs)
PDF documents are read using LangChain PyPDFLoader and split according to the parameters chunk size and overlap size defined during initialization. Finally, the documents, along with their embeddings, are stored in MongoDB Atlas.
The last step is to send the user prompt alongside the retrieved documents based on the relevancy of the query. First, embeddings for the query are generated so that the vector search is performed and relevant documents are retrieved.
1 # Generate and log query embeddings 2 query_embedding = self.embeddings.embed_query(query) 3 4 retrieved_docs = self.retriever.invoke(query)
The number of documents retrieved and the similarity score threshold are set to 5 and 0.2 by default; feel free to tweak them and assess the results.
The main prompt is structured using LangChain, including the conversation history, retrieved context documents, and the user prompt.
1 # Format the input for the LLM, including conversation history 2 formatted_input = { 3 "conversation_history": "\n".join(conversation_history) if conversation_history else "", 4 "context": "\n\n".join(doc.page_content for doc in retrieved_docs), 5 "question": query, 6 } 7 8 # Build the RAG chain 9 chain = ( 10 RunnablePassthrough() # Passes the input as-is 11 | self.prompt # Formats the input for the LLM 12 | self.model # Queries the LLM 13 | StrOutputParser() # Parses the LLM's output 14 )
This detailed guide has explored the latest DeepSeek models, including the DeepSeek-R1 family of reasoning models and their smaller distilled versions, which have shown impressive performance while running on relatively modest consumer hardware.
These models not only deliver exceptional results given their minimal footprint, but they also democratize access to advanced reasoning with some of the first open-weight, high-performance reasoning models available. These advancements are a testament to the highly competitive nature of foundational models and set the stage for a rapid pace of innovation in the near future. Notably, this progress paves the way for the deployment of local-first RAG applications leveraging both MongoDB as the knowledge base and these advanced reasoning models to achieve performance comparable to frontier lab models. This tutorial demonstrated how accessible and practical it is to implement a local solution with DeepSeek and MongoDB, even on consumer-grade hardware.
Learn how MongoDB Atlas takes RAG applications to the next level by exploring advanced RAG and agentic systems in MongoDB's expert webinar: Expanding LLM Applications: From RAG to Agentic Systems