Depth-first Hybrid Search for GraphRAG
Shounak Acharya, Sneha Das Gupta14 min read • Published Feb 01, 2025 • Updated Feb 02, 2025
Rate this article
In the field of retrieval-augmented generation (RAG), five key approaches stand out: pre-filtering, vector search, full-text search, hybrid search, and graph. This article demonstrates a practical use case by comparing these methods applied to a single dataset focused on social support for aging populations. By analyzing their responses to the same user query on a unified platform, we aim to evaluate and showcase the relative accuracy of each approach.
RAG architecture is effective, enabling AI applications to harness the power of unstructured data while promoting knowledge accessibility and large language models (LLM) output relevancy. Special emphasis is placed on PDF documents—the prevalent file format across industries—which remain notoriously difficult to search effectively.
RAG systems are effective for various knowledge-intensive tasks, including text generation tasks, like answering questions in a comprehensive and informative way, and content generation, like poems, scripts, musical pieces, emails, letters, etc.
While building full-scale applications for a knowledge base, we hit several limitations, including:
- Difficulty finding close matches: RAG systems can have trouble distinguishing between closely related topics in large datasets, leading to less accurate matches.
- Matching based on wrong criteria: RAG systems may match queries based on broad similarities instead of specific details.
- Need for efficient throughput: As more users come on board, RAG will need to be able to efficiently handle the increased throughput.
RAG lacks a comprehensive understanding of data, relying primarily on similarity scores between query vector embeddings and retrieved semantically similar documents. In this tutorial, we are using GraphRAG with MongoDB, which offers a more robust overall data understanding by capturing associated information and related entities.
GraphRAG is a concept that can enhance the source of information retrieved from a simple semantic search-based RAG by also capturing relationships between the named entities that form the ontology underlying the actual text. One of the methods to build these kinds of ontologies is LangChain's Graph Transformer library, which can take input documents and construct an ontology consisting of key terms (named entities) and their relationship.
For the purpose of this tutorial, we created a sample ontology obtained by transforming several PDF documents on stress from medical journals. This can be done by combining all pages from the PDFs by putting them into a single directory and then passing it to the graph transformer.
A hybrid search is an aggregation and re-ranking of search results from different information retrieval methods, such as a full-text and semantic search, for the same query criteria. While full-text effectively finds exact matches for query terms, semantic search provides the added benefit of identifying semantically similar documents even if the documents don't contain the exact query term. This ensures that synonymous and contextually similar matches are also included in the combined results of both search methods.

Depth-first graph traversal explores as far down a branch as possible before backtracking to explore other branches. It starts at a node, moves to an adjacent node, and continues deeper until it reaches a dead end, then backtracks to explore other unvisited nodes.
In this tutorial, we first store the knowledge graph obtained from LangChain’s Graph Transformer module in MongoDB. We then apply MongoDB’s
$graphlookup
operator to get the hierarchical relationship between these entities. Since this relationship only provides the node connections and the distance between nodes, without the actual names of the connections, we transform the output into a graph data structure and a hashmap storing the relationship names between the entities.We then traverse the graph in a depth-first manner to create a sentence from the hierarchy of relationships that mentions the transitive relationship names between these entities. These sentences are then combined with relevant text chunks obtained as part of vector search from an input question vector on MongoDB documents from collections storing actual text chunks and their embeddings and are finally given to LLMs as a context in a prompt. This ensures that the LLM answers only from the knowledge context of our store domain, thereby giving more accurate answers, as well as fewer hallucinations.
The following snippet shows the same:
1 # Instantiate an LLMGraphTransformer object 2 llm_transformer = LLMGraphTransformer(llm=llm) 3 4 # Create a ChatPromptTemplate object 5 template = ChatPromptTemplate.from_messages( 6 [ 7 ( 8 "system", 9 """You are a scientist analyzing scientific journal articles on the effects of various factors, including stress. Your task is to identify and document all entity relationships within these articles, focusing on relationships involving stress as well as those involving other entities. Capture both direct and indirect relationships, exploring connections up to at least five levels deep. Ensure that you only document relationships explicitly mentioned in the articles. Be aware that relationships can be bidirectional. Exclude any nodes related to the author, publisher, and references.""" 10 ) 11 ] 12 ) 13 14 # Init the graph transformer, specifying the nodes, relationships and prompt to guide it 15 graph_transformer = LLMGraphTransformer( 16 llm=llm, 17 prompt=template 18 ) 19 20 # Initialize a all_documents list 21 all_documents = [] 22 23 # Process each page individually 24 for page in all_pages: 25 document = Document(page_content=page_string) 26 all_documents.append(document) 27 28 # Convert all accumulated documents to graph documents at once 29 graph_docs = graph_transformer.convert_to_graph_documents(all_documents)
This yields nodes and relationships in the following format:
1 nodes = { 2 "node1":Node(id='Stress', type='Concept'), 3 "node2":Node(id='Heart Disease', type='Disease'), 4 "node3":Node(id='Burnout', type='Condition'), 5 "node4":Node(id='Mental Health', type='Aspect'), 6 "node5":Node(id='Physical Activity', type='Behavior'), 7 "node6":Node(id='Job Satisfaction', type='Emotional state'), 8 "node7":Node(id='Obesity', type='Condition'), 9 "node8":Node(id='Aging', type='Condition'), 10 #... Redacted for conciseness 11 } 12 13 links = { 14 "rel1":Relationship(source=Node(id='Immune System', type='System'), target=Node(id='Inflammation', type='Condition'), type='REDUCES'), 15 "rel2":Relationship(source=Node(id='Inflammation', type='Condition'), target=Node(id='Heart Disease', type='Disease'), type='CAUSES'), 16 "rel3":Relationship(source=Node(id='Stress', type='Concept'), target=Node(id='Sleep Quality', type='Health aspect'), type='DECREASES'), 17 "rel4":Relationship(source=Node(id='Stress', type='Condition'), target=Node(id='Diet', type='Lifestyle factor'), type='WORSENS') 18 #... Redacted for conciseness 19 }
Once we get the nodes and relationships from the LLMGraph Transformer, it is time to transform our dataset into a format optimized for GraphRAG. This entails the following parts:
- Creating a data structure on the database for storing the nodes and relationships we obtained in the step above for doing a graph traversal on the entities using MongoDB’s
$graphlookup
function. - Vectorizing and storing the original PDF documents as chunks and embedding them to allow semantic search for general-purpose RAG.
- Finally, tying back the most relevant nodes to the most relevant PDF chunks so that once we can get the most relevant chunks through vector search for a user query, we should also be able to generate the relationships of the most relevant nodes of those chunks through
$graphlookup
and then feed the chunks, as well as these relationships, as a context to the LLM to generate not only a semantically correct answer but also the provenance of that information through linking the knowledge entities.
Let’s now look at how to implement these steps one by one.
While there are multiple ways to achieve this, the most generic way is to save all nodes in the same collection and take advantage of polymorphism. MongoDB graph lookup takes advantage of this concept by walking through documents in a collection that are related by some common field, as shown in the diagram below:

To perform graph lookup, we differentiate two document types within the same collection with a type field but yet have a common field on all these document types, which is traversed sequentially. For example, in the diagram above, though the same collection has two different document types, they are differentiated by their name fields.
Also, if both the documents have a third field called belongs_to, then a person (Storz, in this case) can be related to a factory having belongs_to = X347BC-Large, and the factory can be related to a department name, let’s say logistics, by having belongs_to = logistics. This is what is utilized by the
$graphlookup
operator to traverse the nodes in the collection by walking the belongs_to field and reaching the next node whose value is in the belongs_to, and then from there, reaching the third node whose value is equal to the belongs_to of the current node. So, in this example, Storz -belongs_to->X347BC-Large -Belongs_to-> Logistics. This is done recursively until there are no more documents found or we reach a setting called as the max depth.However, if we look closely, the exact relationship name is not captured by the
$graphlookup
operator. So, while we know that all are related via the belongs_to field, we cannot exactly say the relationship name—for example, Storz works for X347BC-Large, and X347BC-Large is part of the Logistics department.Now, coming back to our example in medical journals, we not only need to store the nodes and capture relationships between them using a common named field, but we also need to return back the relationship names as well in the
$graphlookup
output. Hence, we transform the nodes and relationships to capture this information, such that the MongoDB documents look like the ones below. We store it in a collection called nodes_relationship:
What we see here is that all the documents have an id, which is the concatenation of the nodes id and type that we got from the LangGraph output. Also, each node has a relationships field that captures the id of all the other nodes it is directly connected to, which is used by the
$graphlookup
operator as the connect_from field and _id as the connect_to field for doing the recursive lookups. Also, we have one more field called “targets” which captures the relationship names for each of the relationships captured in the relationships field. For example “Heart Disease: Disease” is related to “Blood Pressure: Condition” such that the former increases the latter. Similarly, “Burnout: Condition” relates to “Mental Health: Condition” by deteriorating/affecting it. We have multiple relationship names because this was captured by LangGraph, and it is something that can come up in graph scenarios. Please note that we have to save it to augment the $graphlookup
output with relationship names, which can then be sent to an LLM to give more context to a knowledge graph. If we look closely, even the relationship names between two unique nodes are semantically equivalent, such as affects/deteriorates. When you provide context to the LLM with these relationship names, you can randomly take any of the names from the list of relationship names between the nodes.To vectorize and store PDF documents as chunks in MongoDB, you need to follow these steps:
- Load and extract text from PDF
- Split text into chunks
- Generate embeddings
- Store chunks and embeddings in MongoDB collection called knowledge_graph

1 pdf_files = ['Pdf1','pdf2',......'pdfn'] 2 for pdf_file in pdf_files: 3 text = load_pdf(pdf_file) 4 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20) 5 docs = text_splitter.split_documents(text) 6 MongoDBAtlasVectorSearch.from_documents(docs, OpenAIEmbeddings(), db_config)

For this step, we take each node obtained in Step 1 and convert the _id such that “Stress: Concept” will look like “Stress as a Concept.” Essentially, we replace the “:” with the string “as a” to do a full-text search of these transformed node ids with the text of the PDF chunks.
The following diagram depicts the algorithm:

So, for each chunk, we find out the most relevant converted node_ids (node_ids with “:” converted to string “as a”) and then add those original node_ids (without replacing “:” with “as a”) along with their search score into a tags array field on the knowledge_graph documents as shown above.
Before we move into the specifics of this step, let’s understand how
$graphlookup
operates. The general syntax of $graphlookup
is as follows:1 [ 2 { 3 '$match': { 4 '_id': node_name 5 } 6 }, { 7 '$graphLookup': { 8 'from': 'nodes_relationships', 9 'startWith': '$_id', 10 'connectFromField': 'relationships', 11 'connectToField': '_id', 12 'as': 'relates_to', 13 'maxDepth': max_depth, 14 'depthField': 'distance', 15 'restrictSearchWithMatch': {} 16 } 17 } 18 ]
What this does is it will go to the nodes_relationships collection and figure out the document whose _id is equal to the node_name. This node_name can be obtained in various ways. Once it gets to that document, it will take each element from the relationships array and try to find the document whose _id matches the value of that element. This will continue recursively until either no documents remain or we have reached the value of max_depth. The following diagram explains the process:

Here, we start with the node_name = “Stress” and recursively traverse the relationships array across documents, up to a max level of 2. The reader is strongly encouraged to go through the MongoDB documentation on $graphlookup for further understanding of this concept. However, one thing we can notice here is that, even though graphlookup can traverse documents based on a field value, it doesn’t really return the actual relationship names while traversing the tree path. This is what we solve through modifying the graphlookup output into a graph data structure and doing a depth-first search on that structure, as we will see further.
In this step, we combine vector search and graphlookup to give a greater context to the LLM to give an answer which utilizes the uber knowledge across all documents which we created in the previous steps, and then created the knowledge graph.
The first step is to take the question from the user, vectorize it, and do a vector search on the knowledge_graph collection with all the PDF chunks. The vector search operates upon the embedding field, which contains the vectors of the chunks obtained in Step 2.
1 question = "How is Social support Related to Aging?" 2 3 k = 3 4 multiplier = 10 5 6 embeddings = embedding_create_backoff( 7 input=question, 8 model="text-embedding-ada-002" 9 ) 10 11 query_vector = embeddings.data[0].embedding 12 13 agg_pipeline = [ 14 { 15 "$vectorSearch": { 16 "index":'vector_index', 17 "path": "embedding", 18 "queryVector": query_vector, 19 "limit": k, 20 "numCandidates": k * multiplier, 21 }, 22 }, 23 { 24 "$project": {"embedding": 0} 25 } 26 ]
This will return an array of the top three documents whose chunk vectors resemble our question, as shown below:
1 [{'_id': ObjectId('66e2d52d00e19b410567a4e8'), 2 'chunks': 'numerous definitions of social support have been examined (e.g., size, \nquality, and intensity of the support network, type of relation, frequency of\nReproduced with permission of the copyright owner. Further reproduction prohibited without permission.', 3 'source': './PDF_KG/Stress and Headache.pdf', 4 'pdf': {'version': '1.10.100', 5 'info': {'PDFFormatVersion': '1.7', 6 'IsAcroFormPresent': False, 7 'IsXFAPresent': False, 8 'Title': 'Stress and Headache.', 9 'Author': 'Thomas Hamilton Mosley Jr', 10 'Creator': 'Appligent AppendPDF Pro 6.3', 11 'Producer': 'Prince 14.2 (www.princexml.com)', 12 'CreationDate': "D:20230906103014-07'00'", 13 'ModDate': "D:20230906103014-07'00'"}, 14 'metadata': {'_metadata': {'dc:format': 'application/pdf', 15 'dc:creator': 'Thomas Hamilton Mosley Jr', 16 'dc:title': 'Stress and Headache.', 17 'pdf:producer': 'Prince 14.2 (www.princexml.com)', 18 'pdf:appligent': 'AppendPDF Pro 6.3 Linux 64 bit Aug 30 2019 Library 15.0.4', 19 'xmp:creatortool': 'Appligent AppendPDF Pro 6.3', 20 'xmp:modifydate': '2023-09-06T10:30:14-07:00', 21 'xmp:createdate': '2023-09-06T10:30:14-07:00', 22 'xmp:metadatadate': '2023-09-06T10:30:14-07:00', 23 'pdfuaid:part': '1', 24 'xmpmm:documentid': 'uuid:15f91e58-b7e4-11b2-0a00-c04a83010000', 25 'xmpmm:instanceid': 'uuid:15f91e59-b7e4-11b2-0a00-d0b25ff8ff7f'}}, 26 'totalPages': 201}, 27 'loc': {'pageNumber': 57, 'lines': {'from': 24, 'to': 26}}, 28 'tags': [{'tagName': 'Social Support:Social factor', 29 'score': 4.098155975341797}, 30 {'tagName': 'Social Support:Behavior', 'score': 2.827536106109619}, 31 {'tagName': 'Social Support:Factor', 'score': 2.827536106109619}, 32 {'tagName': 'Sleep Quality:Health aspect', 'score': 2.2594480514526367}, 33 {'tagName': 'Social Relationships:Condition', 'score': 1.2706201076507568}, 34 {'tagName': 'Social Support:Aspect', 'score': 2.827536106109619}, 35 {'tagName': 'Sleep Quality:Condition', 'score': 2.2594480514526367}, 36 {'tagName': 'Social Support:Activity', 'score': 2.827536106109619}, 37 {'tagName': 'Sleep Quality:Aspect', 'score': 2.2594480514526367}, 38 {'tagName': 'Social Support:Condition', 'score': 2.827536106109619}]}, 39 ... 40 ]
Now that we have all the top-matched documents, we take the document with the highest vector search score and take the tag with the highest tag score within that document. So, for example, in the output above, we would have selected the tag called “Social Support Factor” as the score is 4.09, greater than any other tags for this chunk. This ensures that we have the most relevant node to which this chunk is related.
We take this node and then do a
$graphlookup
on the nodes_relationship table. This generates an output like below:1 [{'_id': 'Social Support:Social factor', 2 'relates_to': [ {'_id': 'Inflammation:Biological process', 3 'distance': 2, 4 'relationships': ['Heart Disease:Disease'], 5 'targets': {'Heart Disease:Disease': ['CONTRIBUTES_TO']}, 6 'type': 'Biological process'}], 7 'relationships': ['Stress:Concept', 8 'Mental Health:Condition', 9 'Stress:Condition', 10 'Mental Health:Health aspect'], 11 'targets': {'Mental Health:Condition': ['IMPROVES'], 12 'Mental Health:Health aspect': ['IMPROVES'], 13 'Stress:Concept': ['REDUCES'], 14 'Stress:Condition': ['DECREASES', 'REDUCES']}, 15 'type': 'Social factor'}, 16 {'_id': 'Exercise:Activity', 17 'distance': 2, 18 'relationships': ['Physical Health:Condition', 19 'Stress:Concept', 20 'Stress:Condition'], 21 'targets': {'Physical Health:Condition': ['IMPROVES'], 22 'Stress:Concept': ['REDUCES'], 23 'Stress:Condition': ['DECREASES', 'REDUCES']}, 24 'type': 'Activity'},...] 25 ]
If we notice the output above from the graphlookup, we see that we get various nodes which are at a distance of 2 from the “Social Support Factor” node. However, we would not be able to figure out from this output the intermediary node which connects “Social Support Factor” with, let’s say, “Inflammation process.” However, the “targets” field in “Inflammation process” says that it is related to “Heart Disease” in that “Inflammation process” contributes to “Heart Disease.” So, while the graphlookup gives us the path, it does not give us the relationship names between the path nodes in the path. This is where we take the advantage of the “targets” field. We transform the above output and create two data structures:
The graph: This data structure contains the relationships between any two consecutive nodes as shown in the snippet below:
1 {'Inflammation:Condition': ['Heart Disease:Disease', 2 'Diabetes:Disease'], 3 'Job Satisfaction:Condition': ['Burnout:Condition', 4 'Productivity:Condition', 5 'Stress:Condition'], 6 'Social Support:Condition': ['Mental Health:Condition', 7 'Stress:Condition'], 8 'Diet:Behavior': ['Stress:Concept', 9 'Inflammation:Condition', 10 'Stress:Condition'],... 11 }
What this holds is, for example, “Inflammation” is connected to “Heart Disease” and “Diabetes.” Similarly, “Social Support” is connected to “Mental Health” and “Stress.”
The Hashmap: The HashMap holds the relationship names between the nodes as shown below:
1 {('Inflammation:Condition', 'Heart Disease:Disease'): {'CAUSES', 2 'CONTRIBUTES', 3 'CONTRIBUTES_TO', 4 'INCREASES_RISK'}, 5 ('Inflammation:Condition', 6 'Diabetes:Disease'): {'CONTRIBUTES_TO'}, 7 ('Job Satisfaction:Condition', 'Burnout:Condition'): {'DECREASES', 8 'REDUCES'}, 9 ('Job Satisfaction:Condition', 10 'Productivity:Condition'): {'INCREASES'}, 11 ('Job Satisfaction:Condition', 'Stress:Condition'): {'INFLUENCES', 12 'REDUCES'}, 13 ('Social Support:Condition', 14 'Mental Health:Condition'): {'IMPROVES'}, 15 ('Social Support:Condition', 'Stress:Condition'): {'DECREASES', 16 'REDUCES'}, 17 ('Diet:Behavior', 'Stress:Concept'): {'INFLUENCES'}, 18 ('Diet:Behavior', 'Inflammation:Condition'): {'REDUCES'}, 19 ('Diet:Behavior', 'Stress:Condition'): {'INFLUENCES'},... 20 }
So, “Inflammation” Causes/Contributes/Contributes_To/Increases_Risk of “Heart Disease.”
Now, we do a depth-first traversal on the graph and construct a relationship string. This relationship string keeps on getting appended to whenever we visit any node in that the parent node name and the child node name gets appended to the string with the relationship names written in brackets with slashes between them. So in the above case, starting with “Social Support factor” and traversing the graph in a depth-first format yields the following string:
“Social Support (REDUCES / DECREASES) Stress, Stress (WORSENS) Diet, Diet (IMPROVES) Physical Health, Work Environment (AFFECTS / INFLUENCES) Job Satisfaction, Job Satisfaction (INCREASES) Productivity, Productivity (INCREASES) Employee Turnover, Stress (WORSENS) Diet, ”
Now, using the chunk of the top-scoring document based upon the vector search on the question vector and the relationship string we got above, we send the prompt to the LLM. The prompt that we use looks like the following:
1 chunk = tag_docs[top_id_idx]['chunks'] 2 chunk = chunk + f"\n Entity Relationships: {path}" 3 prompt = f''' 4 You are a doctor who specialises in mental health and psychiatry. Please use the below context to answer {question} 5 Please note that the below text has relationship between various entities written after the text Entity Relationships: 6 The relationships are written in comma separated sentences with the relationship names between the entities written in brackets. There can be multiple 7 entity names which are given by slashes. You can choose any of the names to answer the question. Please only answer the question based upon the relationships 8 provided and context given and not from your own previous knowledge. Try and link the relationships with the context, while giving the answer. 9 ''' 10 response = chat_completion_backoff( 11 model="gpt-3.5-turbo", 12 messages=[ 13 {"role": "system", "content": prompt}, 14 {"role": "user", "content": chunk}, 15 ] 16 )
This gives the uber context to the LLM that we had obtained by maintaining the knowledge graph, as well as the chunk, which most relates to the question at hand. This makes sure that the LLM answers back with the exact knowledge from your domain and does not hallucinate. The output looks like below:
“Social support is related to aging primarily through its effects on stress. According to the given relationships, Social Support REDUCES Stress. Since stress can negatively impact various aspects of health, including diet, which in turn IMPROVES Physical Health, the presence of social support can thus contribute positively to overall health and well-being as one ages. This reduction in stress through social support can help mitigate some of the health challenges associated with aging.”
As we can see, the LLM replies back with the knowledge context provided to it and not from its own knowledge, improving hallucinations and improving guardrails.
The depth-first traversal can also be used to generate nodes and relationships that are required by d3.js to display the relationships as we did in the tutorial MongoDB Knowledge Base RAG Visualization. The following figure shows one of the depth-first traversal branches that was generated and displayed on d3.js.

As shown in this tutorial, naive or default RAG has a lot of limitations when providing contextually correct answers to a user’s question. The most important reason for this is that we give only a limited amount of information—the text corresponding to the most relevant chunk, to the LLM—and rely on the LLM’s reasoning capabilities to provide the best answer based on the limited context. This is where mechanisms like providing the uber relationships amongst entities captured in the knowledge base and sharing their relationships with the LLM during the prompt augmentation improves the accuracy of the answers called graph RAGs.
Graph RAG has provided valuable insights into the integration of graph structures with retrieval-augmented generation, emphasizing its effectiveness in managing complex data relationships. The demonstration of MongoDB’s capabilities in representing and querying graph data, along with the implementation of graph algorithms, will enhance the data retrieval processes. This tutorial showed one such approach, and this can be improved further by modifying the same approach. Ultimately, this tutorial aims to empower industries to leverage graph RAG in their own projects, driving innovation and improving the efficiency of data-driven applications.
There are other mechanisms as well. For example, recursive chunking capturing parent-child relationships amongst smaller chunks and their parent chunks and providing them as added contexts also helps improve accuracy. We shall cover improvements and chunking strategies in our upcoming tutorials, so stay tuned.
Interested in a role on MongoDB’s Partner Presales team? We have several open roles on our teams across the globe and would love for you to transform your career with us!
Top Comments in Forums
There are no comments on this article yet.