Haystack 集成入门

在此页面上

背景
先决条件
设置环境
创建 Atlas Vector Search 索引
在 Atlas 中存储自定义数据。
回答有关数据的问题
后续步骤

您可以将 Atlas Vector Search 与Haystack 集成，以使用 LLM 构建自定义应用程序，并实现检索增强生成 (RAG)。本教程演示了如何开始使用带有 Haystack 的 Atlas Vector Search 来对数据执行语义搜索并构建 RAG 实现。具体来说，您需要执行以下操作：

设置环境。
创建 Atlas Vector Search 索引。
在 Atlas 上存储自定义数据。
使用 Atlas Vector Search 来回答有关数据的问题，从而实施RAG 。

使用本教程的可运行版本以作为 Python 笔记本。

背景

Haystack 是一个使用 LLM、嵌入模型和向量搜索来构建自定义应用程序的框架。通过将 Atlas Vector Search 与 Haystack 集成，您可以将 Atlas 用作矢量数据库，并使用 Atlas Vector Search 通过从数据中检索语义相似的文档来实现 RAG。要了解有关 RAG 的更多信息，请参阅使用 Atlas Vector Search 进行检索增强生成 (RAG)。

先决条件

如要完成本教程，您必须具备以下条件：

一个 Atlas 帐户，其集群运行 MongoDB 6.0.11, 7.0.2 或更高版本（包括 RC）。确保您的 IP 地址包含在 Atlas 项目的访问列表中。要了解更多信息，请参阅创建集群。
OpenAI API密钥。您必须拥有一个具有可用于API请求的积分的 OpenAI 帐户。要学习；了解有关注册 OpenAI 帐户的更多信息，请参阅 OpenAI API网站。
用于运行 Python 项目（例如 Colab）的笔记本。

设置环境

为此教程设置环境。通过保存扩展名为 .ipynb 的文件来创建交互式Python笔记本。此 Notebook 允许您单独运行Python代码片段，并且您将使用它来运行本教程中的代码。

要设立笔记本环境，请执行以下操作：

安装并导入依赖项。

运行以下命令：
```
pip install --quiet --upgrade mongodb-atlas-haystack pymongo
```

运行以下代码，导入所需包：

import os
from haystack import Pipeline, Document
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel

定义环境变量。

运行以下代码，将占位符替换为以下值：

您的 OpenAI API 密钥。
您的Atlas集群的SRV连接string 。

os.environ["OPENAI_API_KEY"] = "<api-key>"
os.environ["MONGO_CONNECTION_STRING"]= "<connection-string>"

注意

连接字符串应使用以下格式：

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

创建 Atlas Vector Search 索引

注意

要创建 Atlas Vector Search 索引，您必须对 Atlas 项目具有Project Data Access Admin或更高访问权限。

在本节中，您将创建 haystack_db 数据库和 test 集合来存储自定义数据。然后，为了对数据启用向量搜索查询，您需要创建 Atlas Vector Search 索引。

运行以下代码以连接到您的 Atlas 集群。

client = MongoClient(os.environ.get("MONGO_CONNECTION_STRING"))

创建 `haystack_db.test` 集合。

运行以下代码以创建 haystack_db 数据库和 test 集合。

# Create your database and collection
db_name = "haystack_db"
collection_name = "test"
database = client[db_name]
database.create_collection(collection_name)
# Define collection
collection = client[db_name][collection_name]

定义 Atlas Vector Search 索引。

运行以下代码，创建 vectorSearch 类型的索引。embedding 字段包含您将使用 OpenAI 的 text-embedding-ada-002 嵌入模型创建的嵌入。索引定义指定了 1536 个向量维度，并使用 cosine 来衡量相似性。

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
   definition={
      "fields": [
         {
            "type": "vector",
            "path": "embedding",
            "numDimensions": 1536,
            "similarity": "cosine"
         }
      ]
   },
   name="vector_index",
   type="vectorSearch"
)
collection.create_search_index(model=search_index_model)

构建索引大约需要一分钟时间。在建立索引时，索引处于初始同步状态。构建完成后，您可以开始查询集合中的数据。

在 Atlas 中存储自定义数据。

在本节中，您将把 Atlas 实例化为向量数据库，也称为文档存储。然后，您可以从自定义数据创建向量嵌入，并将这些文档存储在 Atlas 集合中。在 notebook 中粘贴并运行以下代码片段。

将 Atlas 实例化为文档存储。

运行以下代码，将 Atlas 实例化为文档存储。此代码建立与 Atlas 集群的连接并指定以下内容：

haystack_db 和 test 作为用于存储文档的 Atlas 数据库和集合。
vector_index 作为用于运行语义搜索查询的索引。

document_store = MongoDBAtlasDocumentStore(
   database_name="haystack_db",
   collection_name="test",
   vector_search_index="vector_index",
)

在 Atlas 集群上加载样本数据。

该代码定义了一些示例文档，并通过以下组件运行管道：

嵌入器来自 OpenAI，用于将您的文档转换为向量嵌入。
文档撰写器将样本文档及其嵌入内容填充到文档存储中。

# Create some example documents
documents = [
   Document(content="My name is Jean and I live in Paris."),
   Document(content="My name is Mark and I live in Berlin."),
   Document(content="My name is Giorgio and I live in Rome."),
]
# Initializing a document embedder to convert text content into vectorized form.
doc_embedder = OpenAIDocumentEmbedder()
# Setting up a document writer to handle the insertion of documents into the MongoDB collection.
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
# Creating a pipeline for indexing documents. The pipeline includes embedding and writing documents.
indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")
# Connecting the components of the pipeline for document flow.
indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")
# Running the pipeline with the list of documents to index them in MongoDB.
indexing_pipe.run({"doc_embedder": {"documents": documents}})

Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.16it/s]
{'doc_embedder': {'meta': {'model': 'text-embedding-ada-002',
   'usage': {'prompt_tokens': 32, 'total_tokens': 32}}},
 'doc_writer': {'documents_written': 3}}

提示

运行示例代码后，您可以导航到您的集群中的haystack_db.test collection，在 Atlas 用户界面中查看向量嵌入。

回答有关数据的问题

本节演示了如何使用 Atlas Vector Search 和 Haystack 在应用程序中实现 RAG。

以下代码定义并运行具有以下组件的管道：

OpenAITextEmbedder embedder 从您的查询中创建嵌入。
MongoDBAtlasEmbeddingRetriever从您的文档存储中检索与查询嵌入相似的嵌入的检索器。
PromptBuilder 用于传递提示模板，指示 LLM 使用检索到的文档作为提示的上下文。
OpenAIGenerator 生成器使用 OpenAI 的 LLM 生成情境感知响应。

在此示例中，您使用示例查询 Where does Mark live? 作为 LLM 的提示。LLM 会根据您存储在 Atlas 中的自定义数据生成准确、情境感知的响应。

# Template for generating prompts for a movie recommendation engine.
prompt_template = """
    You are an assistant allowed to use the following context documents.\nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    \nQuery: {{query}}
    \nAnswer:
"""
# Setting up a retrieval-augmented generation (RAG) pipeline for generating responses.
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
# Adding a component for retrieving related documents from MongoDB based on the query embedding.
rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store,top_k=15), name="retriever")
# Building prompts based on retrieved documents to be used for generating responses.
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
# Adding a language model generator to produce the final text output.
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
# Connecting the components of the RAG pipeline to ensure proper data flow.
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
# Run the pipeline
query = "Where does Mark live?"
result = rag_pipeline.run(
  {
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query},
  });
print(result['llm']['replies'][0])

Mark lives in Berlin.

后续步骤

MongoDB 还提供以下开发者资源：

提示

另请参阅：

后退

Semantic Kernel

来年

Spring AI