Docs 菜单

使用 LangChain 集成执行父文档检索

您可以将 Atlas Vector Search 与 LangChain 集成,执行 父文档检索。在本教程中,您将完成以下步骤:

  1. 设置环境。

  2. 准备数据。

  3. 实例化父文档检索器。

  4. 创建 Atlas Vector Search 索引。

  5. 在 RAG 管道中使用检索器。

使用可运行版本的本教程以作为 Python 笔记本

父文档检索是一种检索技术,涉及将大型文档分割成较小的子文档。在此技术中,您先查询较小的数据块,然后将完整的父文档返回到 LLM。这可以通过允许对较小的数据块进行更精细的搜索,同时为 LLM 提供父文档的完整上下文,从而改善 RAG 代理和应用程序的响应。

使用 MongoDB 进行父文档检索,可将父文档和子文档存储在单个 MongoDB 集合中,这支持通过仅计算和索引子文档的嵌入来实现高效检索。

如要完成本教程,您必须具备以下条件:

  • Atlas 帐户,其集群运行着 MongoDB 版本 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的 IP 地址包含在 Atlas 项目的 访问列表中。如需了解详情,请参阅创建集群

  • 一个 OpenAI API 密钥。您必须拥有一个 OpenAI 账号,该账号具有可用于 API 请求的信用额度。要了解有关注册 OpenAI 账号的更多信息,请参阅 OpenAI API 网站

  • 运行交互式 Python 笔记本(例如 Colab)的环境。

为此教程设置环境。 通过保存具有 .ipynb 扩展名的文件来创建交互式Python笔记本。 此 Notebook 允许您单独运行Python代码片段,并且您将使用它来运行本教程中的代码。

要设立笔记本环境,请执行以下操作:

1

运行以下命令:

pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-openai pymongo pypdf
2

运行以下代码,将占位符替换为以下值:

import os
os.environ["OPENAI_API_KEY"] = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

将以下代码粘贴到笔记本中并运行,以加载和分割包含最新 MongoDB 收益报告的示例 PDF。

该代码使用文本拆分器将 PDF 数据分割成较小的父文档。它指定每个文档的数据块大小(字符数)和数据块重叠(连续数据块之间的重叠字符数)。

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("https://investors.mongodb.com/node/12881/pdf")
data = loader.load()
# Chunk into parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=20)
docs = parent_splitter.split_documents(data)
# Print a document
docs[0]
Document(metadata={'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 0, 'page_label': '1'}, page_content='MongoDB, Inc. Announces Third Quarter Fiscal 2025 Financial Results\nDecember 9, 2024\nThird Quarter Fiscal 2025 Total Revenue of $529.4 million, up 22% Year-over-Year\nContinued Strong Customer Growth with Over 52,600 Customers as of October 31, 2024\nMongoDB Atlas Revenue up 26% Year-over-Year; 68% of Total Q3 Revenue\nNEW YORK , Dec. 9, 2024 /PRNewswire/ -- MongoDB, Inc. (NASDAQ: MDB) today announced its financial results for the third quarter ended October\n31, 2024.\n\xa0\n \xa0\n"MongoDB\'s third quarter results were significantly ahead of expectations on the top and bottom line, driven by better-than-expected EA performance\nand 26% Atlas revenue growth.\xa0 We continue to see success winning new business due to the superiority of MongoDB\'s developer data platform in\naddressing a wide variety of mission-critical use cases," said Dev Ittycheria, President and Chief Executive Officer of MongoDB .\n"We continue to invest in our legacy app modernization and AI offerings as our document model and distributed architecture are exceptionally well\nsuited for customers looking to build highly-performant, modern applications.\xa0 MongoDB is in a great position to be a central pillar of the emerging AI\ntech stack and benefit from the next wave of application development in the years ahead."\nThird Quarter Fiscal 2025 Financial Highlights\nRevenue: Total revenue was $529.4 million for the third quarter of fiscal 2025, an increase of 22% year-over-year.\nSubscription revenue was $512.2 million, an increase of 22% year-over-year, and services revenue was $17.2 million, an\nincrease of 18% year-over-year.\nGross Profit: Gross profit was $394.0 million for the third quarter of fiscal 2025, representing a 74% gross margin\ncompared to 75% in the year-ago period. Non-GAAP gross profit was $405.7 million, representing a 77% non-GAAP gross\nmargin, consistent with a non-GAAP gross margin of 77% in the year-ago period.')

在本节中,您将实例化父文档检索器,并使用它将数据导入 Atlas。

MongoDBAtlasParentDocumentRetriever 将父文档分割成更小的子文档,嵌入子文档,然后将父文档和子文档都导入到 Atlas 中的同一个集合。在后台,该检索器会创建如下内容:

  • MongoDBAtlasVectorSearch 的一个实例,这是一种矢量存储,用于处理对子文档的矢量搜索查询。

  • MongoDBDocStore的实例,用于处理父文档存储和检索的文档存储。

1

配置 MongoDBAtlasParentDocumentRetriever 的最快方法是使用 from_connection_string 方法。此代码指定以下参数:

  • connection_string您的 Atlas 连接字符串,用于连接到您的集群。

  • child_splitter:用于将父文档分割为较小子文档的文本拆分器。

  • embedding_model用于嵌入子文档的嵌入模型。

  • database_namecollection_name:要导入文档的数据库和集合名称。

  • 以下可选参数用于配置 MongoDBAtlasVectorSearch 向量存储:

    • text_key:文档中包含要嵌入文本的字段。

    • relevance_score:用于向量搜索查询的相关性评分。

    • search_kwargs:在初始搜索中要检索多少个子文档。

from langchain_mongodb.retrievers import MongoDBAtlasParentDocumentRetriever
from langchain_openai import OpenAIEmbeddings
# Define the embedding model to use
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Define the chunking method for the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
# Specify the database and collection name
database_name = "langchain_db"
collection_name = "parent_document"
# Create the parent document retriever
parent_doc_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(
connection_string = ATLAS_CONNECTION_STRING,
child_splitter = child_splitter,
embedding_model = embedding_model,
database_name = database_name,
collection_name = collection_name,
text_key = "page_content",
relevance_score_fn = "dotProduct",
search_kwargs = { "k": 10 },
)
2

然后,运行以下代码,使用检索器的 add_documents 方法将文档导入到 Atlas 中。它将父文档作为输入,并根据您配置检索器的方式接收父文档和子文档。

parent_doc_retriever.add_documents(docs)
3

运行示例代码后,您可以导航到集群中的 langchain_db.parent_document 集合,在 Atlas 用户界面中查看文档。

父文档和子文档都有一个包含分割文本的 page_content 字段。子文档还包含一个额外的 embedding 字段,其中包含分割文本的向量嵌入,以及一个与父文档的 _id 对应的 doc_id 字段。

您可以在 Atlas 用户界面中运行以下查询,将 <id> 占位符替换为有效的文档 ID:

  • 要查看共享相同父文档 ID 的子文档:

    { doc_id: "<id>" }
  • 要查看这些子文档的父文档:

    { _id: "<id>" }

注意

要创建 Atlas Vector Search 索引,您必须对 Atlas 项目具有Project Data Access Admin或更高访问权限。

要在 langchain_db.parent_document 集合上启用向量搜索查询,您必须创建 Atlas Vector Search 索引。您可以使用 LangChain 辅助工具方法或 PyMongo 驱动方法。在笔记本中,按照首选方法运行以下代码:

# Get the vector store instance from the retriever
vector_store = parent_doc_retriever.vectorstore
# Use helper method to create the vector search index
vector_store.create_vector_search_index(
dimensions = 1536 # The dimensions of the vector embeddings to be indexed
)
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel
# Connect to your cluster
client = MongoClient(ATLAS_CONNECTION_STRING)
collection = client[database_name][collection_name]
# Create your vector search index model, then create the index
vector_index_model = SearchIndexModel(
definition={
"fields": [
{
"type": "vector",
"path": "embedding",
"numDimensions": 1536,
"similarity": "dotProduct"
}
]
},
name="vector_index",
type="vectorSearch"
)
collection.create_search_index(model=vector_index_model)

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

一旦 Atlas 构建了索引,您就可以对数据运行向量搜索查询,并在 RAG 管道中使用检索器。将以下代码粘贴并在您的笔记本中运行,以实现一个执行父文档检索的示例 RAG 管道:

1

要查看与给定查询最相关的文档,请粘贴并运行以下代码,在集合上执行示例向量搜索查询。检索器搜索与 string AI technology 语义相似的相关子文档,然后返回这些子文档对应的父文档。

parent_doc_retriever.invoke("AI technology")
[Document(metadata={'_id': '492a138c-1309-4791-a0d0-282d34ea1e55', 'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 1, 'page_label': '2'}, page_content='downturns and/or the effects of rising interest rates, inflation and volatility in the global economy and financial markets on our business and future\noperating results; our potential failure to meet publicly announced guidance or other expectations about our business and future operating results; our\nlimited operating history; our history of losses; failure of our platform to satisfy customer demands; the effects of increased competition; our\ninvestments in new products and our ability to introduce new features, services or enhancements; our ability to effectively expand our sales and\nmarketing organization; our ability to continue to build and maintain credibility with the developer community; our ability to add new customers or\nincrease sales to our existing customers; our ability to maintain, protect, enforce and enhance our intellectual property; the effects of social, ethical and\nregulatory issues relating to the use of new and evolving technologies, such as artificial intelligence, in our offerings or partnerships; the growth and\nexpansion of the market for database products and our ability to penetrate that market; our ability to integrate acquired businesses and technologies\nsuccessfully or achieve the expected benefits of such acquisitions; our ability to maintain the security of our software and adequately address privacy\nconcerns; our ability to manage our growth effectively and successfully recruit and retain additional highly-qualified personnel; and the price volatility of'),
Document(metadata={'_id': 'a937204a-0e85-4827-ac63-124735529d51', 'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 1, 'page_label': '2'}, page_content='that it obtained the AWS Modernization Competency designation and launched a MongoDB University course focused on\nbuilding AI applications with MongoDB and AWS. At Microsoft Ignite, MongoDB announced new technology integrations for\nAI, data analytics, and automating database deployments across on-premises, cloud, and edge environments.\nLaunched in July 2024, the MongoDB AI Applications Program (MAAP) is designed to help companies unleash the power\nof their data and to take advantage of rapidly advancing AI technologies. We recently announced that Capgemini,\nConfluent, IBM, Unstructured, and QuantumBlack, AI by McKinsey have joined the MAAP ecosystem, offering customers\nadditional integration and solution options.\nExecutive Leadership Update\nMichael Gordon, MongoDB\'s Chief Operating Officer and Chief Financial Officer, will be stepping down at the end of the Company\'s fiscal year on\nJanuary 31, 2025, and afterwards will serve as an advisor to ensure a smooth transition. The Company has commenced an executive search process\nfor a new CFO and will evaluate internal and external candidates.\xa0 Serge Tanjga, MongoDB\'s Senior Vice President of Finance, will serve as interim\nCFO starting February 1st if a permanent successor has not been named by that date.\nDev Ittycheria commented, "On behalf of everyone at MongoDB , I want to thank Michael for everything he has done to contribute to our success in his\nnearly 10 years with the company.\xa0 In Michael\'s time here, MongoDB had a successful IPO, has grown revenue nearly 50x and has successfully\nscaled the business model to generate meaningful operating leverage. Michael has also built out a world-class finance team that I am confident will\ndeliver a smooth transition to a new CFO in the coming months."\nMichael Gordon said, "I am incredibly proud of what we have accomplished as a team in my almost ten years with the company.\xa0 While we have')]

要了解有关使用 LangChain 的向量搜索查询示例的更多信息,请参阅运行向量搜索查询。

2

要创建并运行带有父文档检索器的 RAG 管道,请粘贴并运行以下代码。此代码执行以下操作:

  • 定义一个 LangChain 提示模板,指示 LLM 使用检索到的文档作为查询的上下文。LangChain 将这些文档传递给 {context} 输入变量,并将您的查询传递给 {query} 变量。

  • 构建一条 指定以下内容:

    • 您配置的父文档检索器用于检索相关的父文档。

    • 您定义的提示模板。

    • OpenAI 的法学硕士,用于生成上下文感知响应。默认下,这是 gpt-3.5-turbo 模型。

  • 使用示例查询提示链并返回响应。 生成的响应可能会有所不同。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
# Define a prompt template
template = """
Use the following pieces of context to answer the question at the end.
{context}
Question: {query}?
"""
prompt = PromptTemplate.from_template(template)
model = ChatOpenAI()
# Construct a chain to answer questions on your data
chain = (
{"context": parent_doc_retriever, "query": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
# Prompt the chain
query = "In a list, what are MongoDB's latest AI announcements?"
answer = chain.invoke(query)
print(answer)
1. MongoDB obtained the AWS Modernization Competency designation.
2. MongoDB launched a MongoDB University course focused on building AI applications with MongoDB and AWS.
3. MongoDB announced new technology integrations for AI, data analytics, and automating database deployments across various environments.
4. MongoDB launched the MongoDB AI Applications Program (MAAP) to help companies harness the power of data and future AI technologies.
5. Capgemini, Confluent, IBM, Unstructured, and QuantumBlack joined the MAAP ecosystem to offer customers additional integration and solution options.