Building RAG Pipelines With Haystack and MongoDB Atlas
Pavel Duchovny4 min read • Published Sep 18, 2024 • Updated Sep 18, 2024
SNIPPET
Rate this quickstart
Integrating Haystack with MongoDB Atlas allows you to build powerful retrieval-augmented generation (RAG) pipelines. This introductory article will guide you through the process of setting up a Haystack-based RAG pipeline using MongoDB Atlas for vector search. Our code will use a grocery product dataset and the RAG pipeline can fetch relevant products for a user cooking request. Relevant groceries are passed to the LLM for a detailed generated guide.
First, install the necessary dependencies:
1 pip install haystack-ai mongodb-atlas-haystack tiktoken datasets getpass re
If you have not created an Atlas cluster, follow our guide.
Set the MongoDB connection string and
OpenAI API key by following the guide on Open AI website.
1 import os 2 import getpass, re; 3 4 conn_str = getpass.getpass("Enter your MongoDB connection string:") 5 conn_str = (re.sub(r'appName=[^\s]*', 'appName=devrel.content.python', conn_str) 6 if 'appName=' in conn_str 7 else conn_str + ('&' if '?' in conn_str else '?') + 'appName=devrel.content.python') 8 os.environ['MONGO_CONNECTION_STRING']=conn_str 9 print(os.environ['MONGO_CONNECTION_STRING'])
Create a vector index on your database and collection in MongoDB Atlas. For more information and guidance, visit our Atlas Vector Search index docs. In this tutorial, the database is “ai_shop,”, and the collection name is “test_collection.”. Ensure that the index name is vector_index and specify the following syntax:
1 { 2 "fields": [ 3 { 4 "type": "vector", 5 "path": "embedding", 6 "numDimensions": 1536, 7 "similarity": "cosine" 8 } 9 ] 10 }
Load documents into MongoDB Atlas using the Haystack framework:
1 from haystack import Pipeline, Document 2 from haystack.document_stores.types import DuplicatePolicy 3 from haystack.components.writers import DocumentWriter 4 from haystack.components.embedders import OpenAIDocumentEmbedder 5 from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore 6 from bson import json_util 7 8 # Example dataset 9 dataset = { 10 "train": [ 11 {"title": "Spinach Lasagna Sheets", "price": "$3.50", "description": "Infused with spinach, these sheets add a pop of color and extra nutrients.", "category": "Pasta", "emoji": "📗"}, 12 {"title": "Gluten-Free Lasagna Sheets", "price": "$4.00", "description": "Perfect for those with gluten intolerance, made with a blend of rice and corn flour.", "category": "Pasta", "emoji": "🍚🌽"}, 13 # Add more documents here... 14 ] 15 } 16 insert_data = [] 17 for product in dataset['train']: 18 doc_product = json_util.loads(json_util.dumps(product)) 19 haystack_doc = Document(content=doc_product['title'], meta=doc_product) 20 insert_data.append(haystack_doc) 21 document_store = MongoDBAtlasDocumentStore( 22 database_name="ai_shop", 23 collection_name="test_collection", 24 vector_search_index="vector_index", 25 ) 26 doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP) 27 doc_embedder = OpenAIDocumentEmbedder() 28 indexing_pipe = Pipeline() 29 indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder") 30 indexing_pipe.add_component(instance=doc_writer, name="doc_writer") 31 indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents") 32 indexing_pipe.run({"doc_embedder": {"documents": insert_data}})
Create a pipeline that will retrieve, augment, and generate a response to user questions:
1 from haystack.components.generators import OpenAIGenerator 2 from haystack.components.builders.prompt_builder import PromptBuilder 3 from haystack.components.embedders import OpenAITextEmbedder 4 from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever 5 6 # Prompt template 7 prompt_template = """ 8 You are a chef assistant allowed to use the following context documents and only those.\nDocuments: 9 {% for doc in documents %} 10 {{ doc.content }} 11 {% endfor %} 12 \Query: {{query}} 13 \nAnswer: 14 """ 15 16 # init a pipeline 17 rag_pipeline = Pipeline() 18 19 # Add embedder and vector store connected 20 rag_pipeline.add_component("text_embedder", OpenAITextEmbedder()) 21 rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=50), name="retriever") 22 rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") 23 24 ## Add prompt builder and connect context to prompt to LLM 25 rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder") 26 rag_pipeline.add_component(instance=OpenAIGenerator(model="gpt-4o"), name="llm") 27 rag_pipeline.connect("retriever", "prompt_builder.documents") 28 rag_pipeline.connect("prompt_builder", "llm")
Test the pipeline with a sample query:
1 query = "How can I cook a lasagne?" 2 result = rag_pipeline.run( 3 { 4 "text_embedder": {"text": query}, 5 "prompt_builder": {"query": query}, 6 } 7 ) 8 print(result['llm']['replies'][0])
Expected output:
1 To cook a lasagne, you can follow this classic recipe: 2 ### Ingredients: 3 #### For the meat sauce: 4 - 2 tablespoons olive oil 5 - 1 onion, finely chopped 6 - 2 cloves garlic, minced 7 - 500g ground beef 8 - 800g canned tomatoes, crushed 9 - 2 tablespoons tomato paste 10 - 1 teaspoon dried basil 11 - 1 teaspoon dried oregano 12 - Salt and pepper to taste 13 #### For the béchamel sauce: 14 - 4 tablespoons butter 15 - 4 tablespoons all-purpose flour 16 - 500ml milk 17 - A pinch of nutmeg 18 - Salt and pepper to taste 19 #### For assembly: 20 - 250g lasagne sheets 21 - 200g mozzarella cheese, shredded 22 - 1 cup grated Parmesan cheese 23 - Fresh basil leaves for garnish (optional) 24 ### Instructions: 25 1. **Preheat the oven** to 375°F (190°C). 26 2. **Prepare the meat sauce:** 27 - Heat the olive oil in a large skillet over medium heat. 28 - Add the chopped onion and cook until soft and translucent, about 5 minutes. 29 - Stir in the minced garlic and cook for another minute. 30 - Add the ground beef and cook until browned, breaking it up with a spoon as it cooks. 31 - Stir in the crushed tomatoes, tomato paste, dried basil, and dried oregano. 32 - Season with salt and pepper, then reduce the heat to low. 33 - Let the sauce simmer for 30 minutes, stirring occasionally. 34 3. **Prepare the béchamel sauce:** 35 - In a medium saucepan, melt the butter over medium heat. 36 - Add the flour and whisk continuously for about 2 minutes to create a roux. 37 - Gradually add the milk while whisking to prevent lumps from forming. 38 - Cook the mixture, whisking constantly, until it thickens, about 5-7 minutes. 39 - Season with a pinch of nutmeg, salt, and pepper. 40 4. **Assemble the lasagne:** 41 - Spread a thin layer of the meat sauce on the bottom of a 9x13 inch baking dish. 42 - Place a layer of lasagne sheets over the sauce. 43 - Spread another layer of meat sauce over the lasagne sheets, followed by a layer of béchamel sauce. 44 - Sprinkle some shredded mozzarella cheese over the béchamel sauce. 45 - Repeat the layers until all the ingredients are used, finishing with a layer of béchamel sauce and a generous topping of mozzarella and Parmesan cheese. 46 5. **Bake the lasagne:** 47 - Cover the baking dish with aluminum foil. 48 - Bake in the preheated oven for 30 minutes. 49 - Remove the foil and bake for an additional 15 minutes, or until the top is golden brown and bubbling. 50 6. **Rest and serve:** 51 - Remove the lasagne from the oven and let it rest for 10-15 minutes before slicing. 52 - Garnish with fresh basil leaves if desired, and serve. 53 Enjoy your delicious homemade lasagne!
In this article, you learned how to integrate Haystack with MongoDB Atlas to build a RAG pipeline. This powerful combination allows you to leverage vector search and retrieval-augmented generation to create sophisticated and responsive applications.
To explore more topics on RAG, have a look at the following tutorials:
If you have questions or want to connect with other developers, join us in the MongoDB Developer Community. Thanks for reading.