使用 Atlas Vector Search 进行检索增强生成 (RAG)
检索增强生成 (RAG) 是一种架构,用于使用额外数据增强大型语言模型 (LLM),使其能够生成更准确的响应。通过将 LLM 与由 Atlas Vector Search 提供支持的检索系统相结合,您可以在生成式人工智能应用程序中实现 RAG。
为何使用 RAG?
使用 LLM 工作时,您可能会遇到以下限制:
过时数据:LLM 是在特定时间点之前的静态数据集上训练的。这意味着他们的知识库有限,并且可能会使用过时的数据。
无权访问本地数据:LLM 无权访问本地或个性化数据。因此,他们可能缺乏有关特定领域的知识。
幻觉:当训练数据不完整或过时时,LLM 会生成不准确的信息。
您可以按以下步骤实现 RAG 来解决这些限制:
导入:将自定义数据作为向量嵌入存储在向量数据库中,例如 MongoDB Atlas。这样,您就可以创建一个包含最新个性化数据的知识库。
检索:使用 Atlas Vector Search 等搜索解决方案,根据用户的问题从数据库中检索语义相似的文档。这些文档为 LLM 提供了额外的相关数据。
生成:提示 LLM。LLM 使用检索到的文档作为上下文来生成更准确和更相关的响应,从而减少幻觉。
由于 RAG 支持问答和文本生成等任务,因此它是构建提供个性化、特定领域响应的 AI 聊天机器人的有效架构。要创建可用于生产的聊天机器人,您必须配置一个服务器来路由请求并在 RAG 实现的基础上构建用户界面。
使用 Atlas Vector Search 进行 RAG
要使用 Atlas Vector Search 实现RAG,需要将数据输入 Atlas,使用 Atlas Vector Search 检索文档,并使用LLM 生成响应。本部分介绍使用 Atlas Vector Search 实现的基本或简单的 RAG 组件。有关逐步说明,请参阅 开始使用。
接收
RAG 的数据导入涉及处理您的自定义数据并将其存储在矢量数据库中,以便为检索做好准备。要使用 Atlas 作为向量数据库创建基本导入管道,请执行以下操作:
加载数据。
将解析的数据拆分为数据段。
对数据进行处理或分块。分块是指将数据分割成较小的部分,以提高性能。
将数据转换为向量嵌入。
将数据和嵌入存储在 Atlas 中。
将这些嵌入存储在 Atlas 中。您可以将嵌入作为字段与集合中的其他数据一起存储。
Retrieval
构建涉及从向量数据库中搜索并返回最相关文档的检索系统,以增强 LLM。要使用 Atlas Vector Search 检索相关文档,您需要将用户的问题转换为向量嵌入,然后对 Atlas 中的数据运行向量搜索查询,以查找具有最相似嵌入的文档。
要使用 Atlas Vector Search 执行基本检索,请执行以下操作:
在包含向量嵌入的集合上定义 Atlas Vector Search 索引。
根据用户的问题选择以下方法之一来检索文档:
将Atlas Vector Search 与流行框架或服务集成使用。 这些集成包括内置库和工具,使您能够使用Atlas Vector Search轻松构建检索系统。
构建自己的检索系统。您可以定义自己的函数和管道来运行特定用例的 Atlas Vector Search 查询。
要学习如何使用 Atlas Vector Search 构建基本检索系统,请参阅入门。
生成
要生成响应,请将您的检索系统与 LLM 相结合。在执行向量搜索以检索相关文档后,可以将用户的问题和相关文档作为上下文提供给 LLM,这样它就可以生成更准确的答案。
选择以下方法之一来连接到 LLM:
将 Atlas Vector Search 与常用的框架或服务相集成。这些集成包括内置库和工具,可帮助您以最少的设置连接到 LLM。
调用 LLM 的 API。大多数 AI 提供商为其生成模型提供 API,您可以使用这些 API 来生成响应。
加载开源 LLM 。如果您没有API密钥或信用额度,您可以通过从应用程序本地加载开源 LLM 来使用。有关实施示例,请参阅使用Atlas Vector Search构建本地 RAG 实现教程。
开始体验
以下示例演示如何使用由 Atlas Vector Search 和 Hugging Face 的开源模型提供支持的检索系统实现 RAG。
➤ 使用 Select your language(选择您的语言)下拉菜单设置此页面上示例的语言。
先决条件
要完成此示例,您必须具备以下条件:
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
具有读取访问权限的 Hugging Face Access Token。
用于运行 Go 项目的终端和代码编辑器。
Go 已安装。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
Java 开发工具包 (JDK) 版本 8 或更高版本。
设立和运行Java应用程序的环境。我们建议您使用 IntelliJ IDEA 或 Eclipse IDE 等集成开发环境来配置 Maven 或 Gradle,以构建和运行项目。
具有读取访问权限的 Hugging Face Access Token。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
具有读取访问权限的 Hugging Face Access Token。
用于运行 Node.js 项目的终端和代码编辑器。
npm 和 Node.js 已安装。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
具有读取访问权限的 Hugging Face Access Token。
运行交互式 Python 笔记本(例如 Colab)的环境。
注意
如果使用 Colab,请确保笔记本会话的 IP 地址包含在 Atlas 项目的访问列表中。
步骤
设置环境。
初始化您的 Go 项目。
在终端中运行以下命令,创建名为
rag-mongodb
的新目录并初始化项目:mkdir rag-mongodb cd rag-mongodb go mod init rag-mongodb 安装并导入依赖项。
运行以下命令:
go get github.com/joho/godotenv go get go.mongodb.org/mongo-driver/mongo go get github.com/tmc/langchaingo/llms go get github.com/tmc/langchaingo/documentloaders go get github.com/tmc/langchaingo/embeddings/huggingface go get github.com/tmc/langchaingo/llms/huggingface go get github.com/tmc/langchaingo/prompts 创建
.env
文件。在项目中,创建
.env
文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。.envHUGGINGFACEHUB_API_TOKEN = "<access-token>" ATLAS_CONNECTION_STRING = "<connection-string>"
将 <access-token>
占位符值替换为您的 Huging Face访问权限令牌。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
创建一个函数,生成向量嵌入。
在本部分中,您将创建一个函数,该函数:
从 Hugging Face 的模型中心加载 mxbai-embed-large-v1 嵌入模型。
从输入的数据创建向量嵌入。
运行以下命令以创建一个存储常用函数的目录,其中包括您将重用用于创建嵌入的目录。
mkdir common && cd common 在
common
目录中创建一个名为get-embeddings.go
的文件,并将以下代码粘贴到其中:get-embeddings.gopackage common import ( "context" "log" "github.com/tmc/langchaingo/embeddings/huggingface" ) func GetEmbeddings(documents []string) [][]float32 { hf, err := huggingface.NewHuggingface( huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"), huggingface.WithTask("feature-extraction")) if err != nil { log.Fatalf("failed to connect to Hugging Face: %v", err) } embs, err := hf.EmbedDocuments(context.Background(), documents) if err != nil { log.Fatalf("failed to generate embeddings: %v", err) } return embs }
将数据导入 Atlas。
在本节中,您将把 LLM 无法访问的示例数据摄入到 Atlas。以下代码使用适用于 LangChain 的 Go 库和 Go 驱动程序来执行以下操作:
创建一个包含 MongoDB 收益报告的 HTML 文件。
将数据拆分为数据段,并指定数据块大小(字符数)和数据段数据块重叠(连续数据块之间的重叠字符数)。
使用您定义的
GetEmbeddings
函数从分块数据创建向量嵌入。将这些嵌入与分块数据一起存储在 Atlas 集群的
rag_db.test
集合中。
导航到
rag-mongodb
项目目录的根目录。在您的项目中创建一个名为
ingest-data.go
的文件,并将以下代码粘贴到其中:ingest-data.gopackage main import ( "context" "fmt" "io" "log" "net/http" "os" "rag-mongodb/common" // Module that contains the embedding function "github.com/joho/godotenv" "github.com/tmc/langchaingo/documentloaders" "github.com/tmc/langchaingo/textsplitter" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type DocumentToInsert struct { PageContent string `bson:"pageContent"` Embedding []float32 `bson:"embedding"` } func downloadReport(filename string) { _, err := os.Stat(filename) if err == nil { return } url := "https://investors.mongodb.com/node/12236" fmt.Println("Downloading ", url, " to ", filename) resp, err := http.Get(url) if err != nil { log.Fatalf("failed to connect to download the report: %v", err) } defer func() { _ = resp.Body.Close() }() f, err := os.Create(filename) if err != nil { return } defer func() { _ = f.Close() }() _, err = io.Copy(f, resp.Body) if err != nil { log.Fatalf("failed to copy the report: %v", err) } } func main() { ctx := context.Background() filename := "investor-report.html" downloadReport(filename) f, err := os.Open(filename) if err != nil { defer func() { _ = f.Close() }() log.Fatalf("failed to open the report: %v", err) } defer func() { _ = f.Close() }() html := documentloaders.NewHTML(f) split := textsplitter.NewRecursiveCharacter() split.ChunkSize = 400 split.ChunkOverlap = 20 docs, err := html.LoadAndSplit(context.Background(), split) if err != nil { log.Fatalf("failed to chunk the HTML into documents: %v", err) } fmt.Printf("Successfully chunked the HTML into %v documents.\n", len(docs)) if err := godotenv.Load(); err != nil { log.Fatal("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("rag_db").Collection("test") fmt.Println("Generating embeddings.") var pageContents []string for i := range docs { pageContents = append(pageContents, docs[i].PageContent) } embeddings := common.GetEmbeddings(pageContents) docsToInsert := make([]interface{}, len(embeddings)) for i := range embeddings { docsToInsert[i] = DocumentToInsert{ PageContent: pageContents[i], Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) } 运行以下命令来执行代码:
go run ingest-data.go Successfully chunked the HTML into 163 documents. Generating embeddings. Successfully inserted document with id: &{ObjectID("66faffcd60da3f6d4f990fa4")} Successfully inserted document with id: &{ObjectID("66faffce60da3f6d4f990fa5")} ...
使用 Atlas Vector Search 检索文档。
在本节中,您将设置 Atlas Vector Search,以便从向量数据库中检索文档。然后完成以下步骤:
在向量嵌入上创建 Atlas Vector Search 索引。
创建一个名为
rag-vector-index.go
的新文件并粘贴以下代码。此代码连接到您的 Atlas 集群并在rag_db.test
集合上创建 vectorSearch 类型的索引。rag-vector-index.gopackage main import ( "context" "log" "os" "time" "go.mongodb.org/mongo-driver/bson" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Fatal("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Specify the database and collection coll := client.Database("rag_db").Collection("test") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type filterField struct { Type string `bson:"type"` Path string `bson:"path"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embedding", NumDimensions: 1024, Similarity: "cosine"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") log.Println("NOTE: This may take up to a minute.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { log.Printf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } 运行以下命令以创建索引:
go run rag-vector-index.go 定义一个函数来检索相关数据。
在此步骤中,您将创建一个名为
GetQueryResults
的检索函数,用于运行查询以检索相关文档。它使用GetEmbeddings
函数从搜索查询中创建嵌入。然后,它运行查询以返回语义相似的文档。要了解更多信息,请参阅运行向量搜索查询。
在
common
目录中,创建一个名为get-query-results.go
的新文件,并将以下代码粘贴到其中:get-query-results.gopackage common import ( "context" "log" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextWithScore struct { PageContent string `bson:"pageContent"` Score float64 `bson:"score"` } func GetQueryResults(query string) []TextWithScore { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Fatal("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Specify the database and collection coll := client.Database("rag_db").Collection("test") queryEmbedding := GetEmbeddings([]string{query}) vectorSearchStage := bson.D{ {"$vectorSearch", bson.D{ {"index", "vector_index"}, {"path", "embedding"}, {"queryVector", queryEmbedding[0]}, {"exact", true}, {"limit", 5}, }}} projectStage := bson.D{ {"$project", bson.D{ {"_id", 0}, {"pageContent", 1}, {"score", bson.D{{"$meta", "vectorSearchScore"}}}, }}} cursor, err := coll.Aggregate(ctx, mongo.Pipeline{vectorSearchStage, projectStage}) if err != nil { log.Fatalf("failed to execute the aggregation pipeline: %v", err) } var results []TextWithScore if err = cursor.All(context.TODO(), &results); err != nil { log.Fatalf("failed to connect unmarshal retrieved documents: %v", err) } return results } 测试检索数据。
在
rag-mongodb
项目目录中,创建一个名为retrieve-documents-test.go
的新文件。在此步骤中,您将检查刚刚定义的函数是否返回相关结果。将此代码粘贴到您的文件中:
retrieve-documents-test.gopackage main import ( "fmt" "rag-mongodb/common" // Module that contains the GetQueryResults function ) func main() { query := "AI Technology" documents := common.GetQueryResults(query) for _, doc := range documents { fmt.Printf("Text: %s \nScore: %v \n\n", doc.PageContent, doc.Score) } } 运行以下命令来执行代码:
go run retrieve-documents-test.go Text: for the variety and scale of data required by AI-powered applications. We are confident MongoDB will be a substantial beneficiary of this next wave of application development." Score: 0.835033655166626 Text: "As we look ahead, we continue to be incredibly excited by our large market opportunity, the potential to increase share, and become a standard within more of our customers. We also see a tremendous opportunity to win more legacy workloads, as AI has now become a catalyst to modernize these applications. MongoDB's document-based architecture is particularly well-suited for the variety and Score: 0.8280757665634155 Text: to the use of new and evolving technologies, such as artificial intelligence, in our offerings or partnerships; the growth and expansion of the market for database products and our ability to penetrate that market; our ability to integrate acquired businesses and technologies successfully or achieve the expected benefits of such acquisitions; our ability to maintain the security of our software Score: 0.8165900111198425 Text: MongoDB continues to expand its AI ecosystem with the announcement of the MongoDB AI Applications Program (MAAP), which provides customers with reference architectures, pre-built partner integrations, and professional services to help them quickly build AI-powered applications. Accenture will establish a center of excellence focused on MongoDB projects, and is the first global systems Score: 0.8023912906646729 Text: Bendigo and Adelaide Bank partnered with MongoDB to modernize their core banking technology. With the help of MongoDB Relational Migrator and generative AI-powered modernization tools, Bendigo and Adelaide Bank decomposed an outdated consumer-servicing application into microservices and migrated off its underlying legacy relational database technology significantly faster and more easily than Score: 0.7959681749343872
使用 LLM 生成响应。
在本部分中,您将通过提示 LLM 将检索到的文档用作上下文来生成响应。此示例使用您刚刚定义的函数从数据库中检索匹配的文档,此外:
从 Hugging Face 的模型中心访问 Mistral 7B Instruct 模型。
指示 LLM 在提示中包含用户的问题和检索到的文件。
向 LLM 提示有关 MongoDB 最新的 AI 公告。
创建一个名为
generate-responses.go
的新文件,并粘贴以下代码:generate-responses.gopackage main import ( "context" "fmt" "log" "rag-mongodb/common" // Module that contains the GetQueryResults function "strings" "github.com/tmc/langchaingo/llms" "github.com/tmc/langchaingo/llms/huggingface" "github.com/tmc/langchaingo/prompts" ) func main() { ctx := context.Background() query := "AI Technology" documents := common.GetQueryResults(query) var textDocuments strings.Builder for _, doc := range documents { textDocuments.WriteString(doc.PageContent) } question := "In a few sentences, what are MongoDB's latest AI announcements?" template := prompts.NewPromptTemplate( `Answer the following question based on the given context. Question: {{.question}} Context: {{.context}}`, []string{"question", "context"}, ) prompt, err := template.Format(map[string]any{ "question": question, "context": textDocuments.String(), }) opts := llms.CallOptions{ Model: "mistralai/Mistral-7B-Instruct-v0.3", MaxTokens: 150, Temperature: 0.1, } llm, err := huggingface.New(huggingface.WithModel("mistralai/Mistral-7B-Instruct-v0.3")) if err != nil { log.Fatalf("failed to initialize a Hugging Face LLM: %v", err) } completion, err := llms.GenerateFromSinglePrompt(ctx, llm, prompt, llms.WithOptions(opts)) if err != nil { log.Fatalf("failed to generate a response from the prompt: %v", err) } response := strings.Split(completion, "\n\n") if len(response) == 2 { fmt.Printf("Prompt: %v\n\n", response[0]) fmt.Printf("Response: %v\n", response[1]) } } 运行此命令以执行代码。产生的响应可能会有所不同。
go run generate-responses.go Prompt: Answer the following question based on the given context. Question: In a few sentences, what are MongoDB's latest AI announcements? Context: for the variety and scale of data required by AI-powered applications. We are confident MongoDB will be a substantial beneficiary of this next wave of application development.""As we look ahead, we continue to be incredibly excited by our large market opportunity, the potential to increase share, and become a standard within more of our customers. We also see a tremendous opportunity to win more legacy workloads, as AI has now become a catalyst to modernize these applications. MongoDB's document-based architecture is particularly well-suited for the variety andto the use of new and evolving technologies, such as artificial intelligence, in our offerings or partnerships; the growth and expansion of the market for database products and our ability to penetrate that market; our ability to integrate acquired businesses and technologies successfully or achieve the expected benefits of such acquisitions; our ability to maintain the security of our softwareMongoDB continues to expand its AI ecosystem with the announcement of the MongoDB AI Applications Program (MAAP), which provides customers with reference architectures, pre-built partner integrations, and professional services to help them quickly build AI-powered applications. Accenture will establish a center of excellence focused on MongoDB projects, and is the first global systemsBendigo and Adelaide Bank partnered with MongoDB to modernize their core banking technology. With the help of MongoDB Relational Migrator and generative AI-powered modernization tools, Bendigo and Adelaide Bank decomposed an outdated consumer-servicing application into microservices and migrated off its underlying legacy relational database technology significantly faster and more easily than expected. Response: MongoDB's latest AI announcements include the launch of the MongoDB AI Applications Program (MAAP) and a partnership with Accenture to establish a center of excellence focused on MongoDB projects. Additionally, Bendigo and Adelaide Bank have partnered with MongoDB to modernize their core banking technology using MongoDB's AI-powered modernization tools.
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件中的dependencies
大量,并将物料清单 (BOM) 添加到dependencyManagement
大量:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for Hugging Face models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-hugging-face</artifactId> </dependency> <!-- Java library for URL Document Loader --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j</artifactId> </dependency> <!-- Java library for ApachePDFBox Document Parser --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-document-parser-apache-pdfbox</artifactId> </dependency> </dependencies> <dependencyManagement> <dependencies> <!-- Bill of Materials (BOM) to manage Java library versions --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-bom</artifactId> <version>0.36.2</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> 如果您使用 Gradle,请将以下物料清单 (BOM) 和依赖项添加到项目的
build.gradle
文件中的dependencies
大量:build.gradledependencies { // Bill of Materials (BOM) to manage Java library versions implementation platform('dev.langchain4j:langchain4j-bom:0.36.2') // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:5.2.0' // Java library for Hugging Face models implementation 'dev.langchain4j:langchain4j-hugging-face' // Java library for URL Document Loader implementation 'dev.langchain4j:langchain4j' // Java library for Apache PDFBox Document Parser implementation 'dev.langchain4j:langchain4j-document-parser-apache-pdfbox' } 运行包管理器以安装项目。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后在 Environment variables字段中将变量添加为分号分隔的值(示例
FOO=123;BAR=456
)。应用更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
HUGGING_FACE_ACCESS_TOKEN=<access-token> ATLAS_CONNECTION_STRING=<connection-string>
使用以下值更新占位符:
将
<access-token>
占位符值替换为您的 Huging Face访问权限令牌。用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义解析和分割数据的方法。
创建一个名为PDFProcessor.java
的文件并粘贴以下代码。
此代码定义了以下方法:
parsePDFDocument
方法使用Apache PDFBox 库和 LangChain4 j URL文档加载器加载并解析给定URL的 PDF文件。该方法将解析后的 PDF 作为 langChain4 j 文档返回。splitDocument
4方法根据指定的数据段大小(字符数)和数据数据块重叠(连续数据段之间的重叠字符数)将给定的数据块 j 文档拆分为数据段。该方法返回文本段列表。
import dev.langchain4j.data.document.Document; import dev.langchain4j.data.document.DocumentParser; import dev.langchain4j.data.document.DocumentSplitter; import dev.langchain4j.data.document.loader.UrlDocumentLoader; import dev.langchain4j.data.document.parser.apache.pdfbox.ApachePdfBoxDocumentParser; import dev.langchain4j.data.document.splitter.DocumentByCharacterSplitter; import dev.langchain4j.data.segment.TextSegment; import java.util.List; public class PDFProcessor { /** Parses a PDF document from the specified URL, and returns a * langchain4j Document object. * */ public static Document parsePDFDocument(String url) { DocumentParser parser = new ApachePdfBoxDocumentParser(); return UrlDocumentLoader.load(url, parser); } /** Splits a parsed langchain4j Document based on the specified chunking * parameters, and returns an array of text segments. */ public static List<TextSegment> splitDocument(Document document) { int maxChunkSize = 400; // number of characters int maxChunkOverlap = 20; // number of overlapping characters between consecutive chunks DocumentSplitter splitter = new DocumentByCharacterSplitter(maxChunkSize, maxChunkOverlap); return splitter.split(document); } }
定义生成向量嵌入的方法。
创建一个名为EmbeddingProvider.java
的文件并粘贴以下代码。
此代码定义了两种使用 mxbai-embed-large-v1 开源嵌入模型为给定输入生成嵌入的方法:
多个输入:
getEmbeddings
方法接受文本段输入大量(List<TextSegment>
),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。单个输入:
getEmbedding
方法接受单个String
,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.huggingface.HuggingFaceChatModel; import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static HuggingFaceEmbeddingModel embeddingModel; private static HuggingFaceEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN"); if (accessToken == null || accessToken.isEmpty()) { throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty."); } embeddingModel = HuggingFaceEmbeddingModel.builder() .accessToken(accessToken) .modelId("mixedbread-ai/mxbai-embed-large-v1") .waitForModel(true) .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Returns the Hugging Face chat model interface used by the createPrompt() method * to process queries and generate responses. */ private static HuggingFaceChatModel chatModel; public static HuggingFaceChatModel getChatModel() { String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN"); if (accessToken == null || accessToken.isEmpty()) { throw new IllegalStateException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty."); } if (chatModel == null) { chatModel = HuggingFaceChatModel.builder() .timeout(ofSeconds(25)) .modelId("mistralai/Mistral-7B-Instruct-v0.3") .temperature(0.1) .maxNewTokens(150) .accessToken(accessToken) .waitForModel(true) .build(); } return chatModel; } /** * Takes an array of text segments and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<TextSegment> texts) { List<TextSegment> textSegments = texts.stream() .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public static BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
定义将数据引入Atlas的方法。
创建一个名为DataIngest.java
的文件并粘贴以下代码。
此代码使用 LangChain4 j 库和MongoDB Java同步驱动程序将 LLM 无权访问权限的示例数据提取到Atlas中。
具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
使用您之前定义的 方法从URL加载并解析MongoDB收益报告 PDF文件。
parsePDFDocument
使用您之前定义的
splitDocument
方法将数据拆分为数据段。使用您之前定义的
GetEmbeddings
方法从分块数据创建向量嵌入。将嵌入与分块数据一起存储在Atlas 集群的
rag_db.test
集合中。DataIngest.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.result.InsertManyResult; import dev.langchain4j.data.segment.TextSegment; import org.bson.BsonArray; import org.bson.Document; import java.util.ArrayList; import java.util.List; public class DataIngest { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("rag_db"); MongoCollection<Document> collection = database.getCollection("test"); // parse the PDF file at the specified URL String url = "https://investors.mongodb.com/node/12236/pdf"; String fileName = "mongodb_annual_report.pdf"; System.out.println("Parsing the [" + fileName + "] file from url: " + url); dev.langchain4j.data.document.Document parsedDoc = PDFProcessor.parsePDFDocument(url); // split (or "chunk") the parsed document into text segments List<TextSegment> segments = PDFProcessor.splitDocument(parsedDoc); System.out.println(segments.size() + " text segments created successfully."); // create vector embeddings from the chunked data (i.e. text segments) System.out.println("Creating vector embeddings from the parsed data segments. This may take a few moments."); List<Document> documents = embedText(segments); // insert the embeddings into the Atlas collection try { System.out.println("Ingesting data into the " + collection.getNamespace() + " collection."); insertDocuments(documents, collection); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } /** * Embeds text segments into vector embeddings using the EmbeddingProvider * class and returns a list of BSON documents containing the text and * generated embeddings. */ private static List<Document> embedText(List<TextSegment> segments) { EmbeddingProvider embeddingProvider = new EmbeddingProvider(); List<BsonArray> embeddings = embeddingProvider.getEmbeddings(segments); List<Document> documents = new ArrayList<>(); int i = 0; for (TextSegment segment : segments) { Document doc = new Document("text", segment.text()).append("embedding", embeddings.get(i)); documents.add(doc); i++; } return documents; } /** * Inserts a list of BSON documents into the specified MongoDB collection. */ private static void insertDocuments(List<Document> documents, MongoCollection<Document> collection) { List<String> insertedIds = new ArrayList<>(); InsertManyResult result = collection.insertMany(documents); result.getInsertedIds().values() .forEach(doc -> insertedIds.add(doc.toString())); System.out.println(insertedIds.size() + " documents inserted into the " + collection.getNamespace() + " collection successfully."); } }
生成嵌入。
注意
503 调用 Hushing Face 模型时
在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。
保存并运行DataIngest.java
文件。输出类似于:
Parsing the [mongodb_annual_report.pdf] file from url: https://investors.mongodb.com/node/12236/pdf 72 text segments created successfully. Creating vector embeddings from the parsed data segments. This may take a few moments... Ingesting data into the rag_db.test collection. 72 documents inserted into the rag_db.test collection successfully.
使用 Atlas Vector Search 检索文档。
在本部分中,您设立Atlas Vector Search以从向量数据库中检索文档。
创建一个名为
VectorIndex.java
的文件并粘贴以下代码。此代码使用以下索引定义在集合上创建Atlas Vector Search索引:
VectorIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class VectorIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("rag_db"); MongoCollection<Document> collection = database.getCollection("test"); // define the index details for the index model String indexName = "vector_index"; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embedding") .append("numDimensions", 1024) .append("similarity", "cosine"))); SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // create the index using the defined model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created vector index named: " + result); System.out.println("It may take up to a minute for the index to build before you can query using it."); } catch (Exception e) { throw new RuntimeException(e); } // wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); waitForIndexReady(collection, indexName); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } /** * Polls the collection to check whether the specified index is ready to query. */ public static void waitForIndexReady(MongoCollection<Document> collection, String indexName) throws InterruptedException { ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); while (true) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { System.out.println(indexName + " index is ready to query"); return; } else { Thread.sleep(500); } } } } } 创建 Atlas Vector Search 索引。
保存并运行该文件。 输出类似如下所示:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
创建代码以使用 LLM 生成响应。
在本节中,您将通过提示 LLM 使用检索到的文档作为上下文来生成响应。
创建一个名为 LLMPrompt.java
的新文件,并将以下代码粘贴到其中。
此代码执行以下操作:
使用
retrieveDocuments
方法查询rag_db.test
集合中是否有任何匹配的文档。此方法使用您之前创建的
getEmbedding
方法从搜索查询生成嵌入,然后运行查询以返回语义相似的文档。要了解更多信息,请参阅运行向量搜索查询。
从 Huugging Face 的模型中心访问 Mistral7 B Instruct 模型,并使用
createPrompt
方法创建模板化提示。该方法指示 LLM 将用户的问题和检索到的文档包含在定义的提示中。
向法学硕士提示 MongoDB 的最新AI公告,然后返回生成的响应。
LLMPrompt.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import dev.langchain4j.data.message.AiMessage; import dev.langchain4j.model.huggingface.HuggingFaceChatModel; import dev.langchain4j.model.input.Prompt; import dev.langchain4j.model.input.PromptTemplate; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class LLMPrompt { // User input: the question to answer static String question = "In a few sentences, what are MongoDB's latest AI announcements?"; public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("rag_db"); MongoCollection<Document> collection = database.getCollection("test"); // generate a response to the user question try { createPrompt(question, collection); } catch (Exception e) { throw new RuntimeException("An error occurred while generating the response: ", e); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } /** * Returns a list of documents from the specified MongoDB collection that * match the user's question. * NOTE: Update or omit the projection stage to change the desired fields in the response */ public static List<Document> retrieveDocuments(String question, MongoCollection<Document> collection) { try { // generate the query embedding to use in the vector search BsonArray queryEmbeddingBsonArray = EmbeddingProvider.getEmbedding(question); List<Double> queryEmbedding = new ArrayList<>(); for (BsonValue value : queryEmbeddingBsonArray.stream().toList()) { queryEmbedding.add(value.asDouble().getValue()); } // define the pipeline stages for the vector search index String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embedding"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, queryEmbedding, indexName, limit, exactVectorSearchOptions()), project( fields( exclude("_id"), include("text"), metaVectorSearchScore("score")))); // run the query and return the matching documents List<Document> matchingDocuments = new ArrayList<>(); collection.aggregate(pipeline).forEach(matchingDocuments::add); return matchingDocuments; } catch (Exception e) { System.err.println("Error occurred while retrieving documents: " + e.getMessage()); return new ArrayList<>(); } } /** * Creates a templated prompt from a submitted question string and any retrieved documents, * then generates a response using the Hugging Face chat model. */ public static void createPrompt(String question, MongoCollection<Document> collection) { // retrieve documents matching the user's question List<Document> retrievedDocuments = retrieveDocuments(question, collection); if (retrievedDocuments.isEmpty()) { System.out.println("No relevant documents found. Unable to generate a response."); return; } else System.out.println("Generating a response from the retrieved documents. This may take a few moments."); // define a prompt template HuggingFaceChatModel huggingFaceChatModel = EmbeddingProvider.getChatModel(); PromptTemplate promptBuilder = PromptTemplate.from(""" Answer the following question based on the given context: Question: {{question}} Context: {{information}} ------- """); // build the information string from the retrieved documents StringBuilder informationBuilder = new StringBuilder(); for (Document doc : retrievedDocuments) { String text = doc.getString("text"); informationBuilder.append(text).append("\n"); } Map<String, Object> variables = new HashMap<>(); variables.put("question", question); variables.put("information", informationBuilder); // generate and output the response from the chat model Prompt prompt = promptBuilder.apply(variables); AiMessage response = huggingFaceChatModel.generate(prompt.toUserMessage()).content(); // extract the generated text to output a formatted response String responseText = response.text(); String marker = "-------"; int markerIndex = responseText.indexOf(marker); String generatedResponse; if (markerIndex != -1) { generatedResponse = responseText.substring(markerIndex + marker.length()).trim(); } else { generatedResponse = responseText; // else fallback to the full response } // output the question and formatted response System.out.println("Question:\n " + question); System.out.println("Response:\n " + generatedResponse); // output the filled-in prompt and context information for demonstration purposes System.out.println("\n" + "---- Prompt Sent to LLM ----"); System.out.println(prompt.text() + "\n"); } }
使用 LLM 生成响应。
保存并运行文件。输出如下所示,但请注意,生成的响应可能会有所不同。
Generating a response from the retrieved documents. This may take a few moments. Question: In a few sentences, what are MongoDB's latest AI announcements? Response: MongoDB's latest AI announcements include the MongoDB AI Applications Program (MAAP), which provides customers with reference architectures, pre-built partner integrations, and professional services to help them quickly build AI-powered applications. Accenture will establish a center of excellence focused on MongoDB projects. These announcements highlight MongoDB's growing focus on AI application development and its potential to modernize legacy workloads. ---- Prompt Sent to LLM ---- Answer the following question based on the given context: Question: In a few sentences, what are MongoDB's latest AI announcements? Context: time data. MongoDB continues to expand its AI ecosystem with the announcement of the MongoDB AI Applications Program (MAAP), which provides customers with reference architectures, pre-built partner integrations, and professional services to help them quickly build AI-powered applications. Accenture will establish a center of excellence focused on MongoDB projects, and is the first global systems i ighlights MongoDB announced a number of new products and capabilities at MongoDB.local NYC. Highlights included the preview of MongoDB 8.0—with significant performance improvements such as faster reads and updates, along with significantly faster bulk inserts and time series queries—and the general availability of Atlas Stream Processing to build sophisticated, event-driven applications with real- ble future as well as the criticality of MongoDB to artificial intelligence application development. These forward-looking statements include, but are not limited to, plans, objectives, expectations and intentions and other statements contained in this press release that are not historical facts and statements identified by words such as "anticipate," "believe," "continue," "could," "estimate," "e ve Officer of MongoDB. "As we look ahead, we continue to be incredibly excited by our large market opportunity, the potential to increase share, and become a standard within more of our customers. We also see a tremendous opportunity to win more legacy workloads, as AI has now become a catalyst to modernize these applications. MongoDB's document-based architecture is particularly well-suited for t ictable, impact on its future GAAP financial results. Conference Call Information MongoDB will host a conference call today, May 30, 2024, at 5:00 p.m. (Eastern Time) to discuss its financial results and business outlook. A live webcast of the call will be available on the "Investor Relations" page of MongoDB's website at https://investors.mongodb.com. To access the call by phone, please go to thi
设置环境。
初始化您的 Node.js 项目。
在终端中运行以下命令,创建名为
rag-mongodb
的新目录并初始化项目:mkdir rag-mongodb cd rag-mongodb npm init -y 安装并导入依赖项。
运行以下命令:
npm install mongodb langchain @langchain/community @xenova/transformers @huggingface/inference pdf-parse 更新您的
package.json
文件。在项目的
package.json
文件中,按以下示例所示指定type
字段,然后保存文件。{ "name": "rag-mongodb", "type": "module", ... 创建
.env
文件。在项目中,创建
.env
文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。HUGGING_FACE_ACCESS_TOKEN = "<access-token>" ATLAS_CONNECTION_STRING = "<connection-string>" Replace the ``<access-token>`` placeholder value with your Hugging Face access token. .. include:: /includes/avs-examples/shared/avs-replace-connection-string.rst 注意
最低 Node.js 版本要求
Node.js v 20 .x 引入了
--env-file
选项。如果您使用的是旧版本的 Node.js,请将dotenv
包添加到项目中,或使用其他方法来管理环境变量。
创建一个函数,生成向量嵌入。
在本部分中,您将创建一个函数,该函数:
从 Hugging Face 的模型中心加载 nomic-embed-text-v1 嵌入模型。
从输入的数据创建向量嵌入。
在您的项目中创建一个名为 get-embeddings.js
的文件,并将以下代码粘贴到该文件中:
import { pipeline } from '@xenova/transformers'; // Function to generate embeddings for a given data source export async function getEmbedding(data) { const embedder = await pipeline( 'feature-extraction', 'Xenova/nomic-embed-text-v1'); const results = await embedder(data, { pooling: 'mean', normalize: true }); return Array.from(results.data); }
将数据导入 Atlas。
在本部分中,您将 LLM 无法访问的示例数据导入 Atlas。以下代码使用 LangChain 集成和 Node.js 驱动程序执行以下操作:
加载包含 MongoDB 收益报告的 PDF。
将数据拆分为数据段,并指定数据块大小(字符数)和数据段数据块重叠(连续数据块之间的重叠字符数)。
使用您定义的
getEmbeddings
函数从分块数据创建向量嵌入。将这些嵌入与分块数据一起存储在 Atlas 集群的
rag_db.test
集合中。
在您的项目中创建一个名为 ingest-data.js
的文件,并将以下代码粘贴到该文件中:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"; import { RecursiveCharacterTextSplitter } from "langchain/text_splitter"; import { MongoClient } from 'mongodb'; import { getEmbeddings } from './get-embeddings.js'; import * as fs from 'fs'; async function run() { const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); try { // Save online PDF as a file const rawData = await fetch("https://investors.mongodb.com/node/12236/pdf"); const pdfBuffer = await rawData.arrayBuffer(); const pdfData = Buffer.from(pdfBuffer); fs.writeFileSync("investor-report.pdf", pdfData); const loader = new PDFLoader(`investor-report.pdf`); const data = await loader.load(); // Chunk the text from the PDF const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 400, chunkOverlap: 20, }); const docs = await textSplitter.splitDocuments(data); console.log(`Successfully chunked the PDF into ${docs.length} documents.`); // Connect to your Atlas cluster await client.connect(); const db = client.db("rag_db"); const collection = db.collection("test"); console.log("Generating embeddings and inserting documents."); let docCount = 0; await Promise.all(docs.map(async doc => { const embeddings = await getEmbeddings(doc.pageContent); // Insert the embeddings and the chunked PDF data into Atlas await collection.insertOne({ document: doc, embedding: embeddings, }); docCount += 1; })) console.log(`Successfully inserted ${docCount} documents.`); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
然后,运行以下命令来执行代码:
node --env-file=.env ingest-data.js
提示
此代码需要一些时间才能运行。您可以通过导航到 Atlas UI 中的 rag_db.test
集合来查看插入的矢量嵌入。
使用 Atlas Vector Search 检索文档。
在本节中,您将设置 Atlas Vector Search,以便从向量数据库中检索文档。然后完成以下步骤:
在向量嵌入上创建 Atlas Vector Search 索引。
创建一个名为
rag-vector-index.js
的新文件并粘贴以下代码。此代码连接到您的 Atlas 集群并在rag_db.test
集合上创建 vectorSearch 类型的索引。import { MongoClient } from 'mongodb'; // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("rag_db"); const collection = database.collection("test"); // Define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "numDimensions": 768, "path": "embedding", "similarity": "cosine" } ] } } // Call the method to create the index const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); 然后,运行以下命令来执行代码:
node --env-file=.env rag-vector-index.js 定义一个函数来检索相关数据。
创建一个名为
retrieve-documents.js
的新文件。在此步骤中,您将创建一个名为
getQueryResults
的检索函数,用于运行查询以检索相关文档。它使用getEmbeddings
函数从搜索查询中创建嵌入。然后,它运行查询以返回语义相似的文档。要了解更多信息,请参阅运行向量搜索查询。
将此代码粘贴到您的文件中:
import { MongoClient } from 'mongodb'; import { getEmbeddings } from './get-embeddings.js'; // Function to get the results of a vector query export async function getQueryResults(query) { // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); try { // Get embeddings for a query const queryEmbeddings = await getEmbeddings(query); await client.connect(); const db = client.db("rag_db"); const collection = db.collection("test"); const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbeddings, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, document: 1, } } ]; // Retrieve documents from Atlas using this Vector Search query const result = collection.aggregate(pipeline); const arrayOfQueryDocs = []; for await (const doc of result) { arrayOfQueryDocs.push(doc); } return arrayOfQueryDocs; } catch (err) { console.log(err.stack); } finally { await client.close(); } } 测试检索数据。
创建一个名为
retrieve-documents-test.js
的新文件。在此步骤中,您将检查刚刚定义的函数是否返回相关结果。将此代码粘贴到您的文件中:
import { getQueryResults } from './retrieve-documents.js'; async function run() { try { const query = "AI Technology"; const documents = await getQueryResults(query); documents.forEach( doc => { console.log(doc); }); } catch (err) { console.log(err.stack); } } run().catch(console.dir); 然后,运行以下命令来执行代码:
node --env-file=.env retrieve-documents-test.js { document: { pageContent: 'MongoDB continues to expand its AI ecosystem with the announcement of the MongoDB AI Applications Program (MAAP),', metadata: { source: 'investor-report.pdf', pdf: [Object], loc: [Object] }, id: null } } { document: { pageContent: 'artificial intelligence, in our offerings or partnerships; the growth and expansion of the market for database products and our ability to penetrate that\n' + 'market; our ability to integrate acquired businesses and technologies successfully or achieve the expected benefits of such acquisitions; our ability to', metadata: { source: 'investor-report.pdf', pdf: [Object], loc: [Object] }, id: null } } { document: { pageContent: 'more of our customers. We also see a tremendous opportunity to win more legacy workloads, as AI has now become a catalyst to modernize these\n' + "applications. MongoDB's document-based architecture is particularly well-suited for the variety and scale of data required by AI-powered applications. \n" + 'We are confident MongoDB will be a substantial beneficiary of this next wave of application development."', metadata: { source: 'investor-report.pdf', pdf: [Object], loc: [Object] }, id: null } } { document: { pageContent: 'which provides customers with reference architectures, pre-built partner integrations, and professional services to help\n' + 'them quickly build AI-powered applications. Accenture will establish a center of excellence focused on MongoDB projects,\n' + 'and is the first global systems integrator to join MAAP.', metadata: { source: 'investor-report.pdf', pdf: [Object], loc: [Object] }, id: null } } { document: { pageContent: 'Bendigo and Adelaide Bank partnered with MongoDB to modernize their core banking technology. With the help of\n' + 'MongoDB Relational Migrator and generative AI-powered modernization tools, Bendigo and Adelaide Bank decomposed an\n' + 'outdated consumer-servicing application into microservices and migrated off its underlying legacy relational database', metadata: { source: 'investor-report.pdf', pdf: [Object], loc: [Object] }, id: null } }
使用 LLM 生成响应。
在本部分中,您将通过提示 LLM 将检索到的文档用作上下文来生成响应。此示例使用您刚刚定义的函数从数据库中检索匹配的文档,此外:
从 Hugging Face 的模型中心访问 Mistral 7B Instruct 模型。
指示 LLM 在提示中包含用户的问题和检索到的文件。
向 LLM 提示有关 MongoDB 最新的 AI 公告。
创建一个名为 generate-responses.js
的新文件,并粘贴以下代码:
import { getQueryResults } from './retrieve-documents.js'; import { HfInference } from '@huggingface/inference' async function run() { try { // Specify search query and retrieve relevant documents const query = "AI Technology"; const documents = await getQueryResults(query); // Build a string representation of the retrieved documents to use in the prompt let textDocuments = ""; documents.forEach(doc => { textDocuments += doc.document.pageContent; }); const question = "In a few sentences, what are MongoDB's latest AI announcements?"; // Create a prompt consisting of the question and context to pass to the LLM const prompt = `Answer the following question based on the given context. Question: {${question}} Context: {${textDocuments}} `; // Connect to Hugging Face, using the access token from the environment file const hf = new HfInference(process.env.HUGGING_FACE_ACCESS_TOKEN); const llm = hf.endpoint( "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3" ); // Prompt the LLM to answer the question using the // retrieved documents as the context const output = await llm.chatCompletion({ model: "mistralai/Mistral-7B-Instruct-v0.2", messages: [{ role: "user", content: prompt }], max_tokens: 150, }); // Output the LLM's response as text. console.log(output.choices[0].message.content); } catch (err) { console.log(err.stack); } } run().catch(console.dir);
然后,运行此命令以执行代码。生成的响应可能会有所不同。
node --env-file=.env generate-responses.js
MongoDB's latest AI announcements include the launch of the MongoDB AI Applications Program (MAAP), which provides customers with reference architectures, pre-built partner integrations, and professional services to help them build AI-powered applications quickly. Accenture has joined MAAP as the first global systems integrator, establishing a center of excellence focused on MongoDB projects. Additionally, Bendigo and Adelaide Bank have partnered with MongoDB to modernize their core banking technology using MongoDB's Relational Migrator and generative AI-powered modernization tools.
将数据导入 Atlas。
在本节中,您将把 LLM 无法访问的示例数据摄入到 Atlas。在笔记本中粘贴并运行以下每个代码片段:
定义一个函数来生成向量嵌入。
运行此代码以创建一个函数,该函数使用开源嵌入模型生成向量嵌入。具体而言,此代码执行以下操作:
加载 nomic-embed-text-v1 嵌入模型(来自 Sentence Transformers)。
创建名为
get_embedding
的函数,该函数使用模型为给定的文本输入生成嵌入。
from sentence_transformers import SentenceTransformer # Load the embedding model (https://huggingface.co/nomic-ai/nomic-embed-text-v1") model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True) # Define a function to generate embeddings def get_embedding(data): """Generates vector embeddings for the given data.""" embedding = model.encode(data) return embedding.tolist() 加载并分割数据。
运行此代码以使用 LangChain 集成加载和分割样本数据。具体而言,此代码执行以下操作:
加载包含 MongoDB 收益报告的 PDF。
将数据拆分为数据块,并指定数据块大小(字符数)和数据块重叠(连续数据块之间的重叠字符数)。
from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Load the PDF loader = PyPDFLoader("https://investors.mongodb.com/node/12236/pdf") data = loader.load() # Split the data into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=20) documents = text_splitter.split_documents(data) 将数据转换为向量嵌入。
运行此代码,创建一个包含相应向量嵌入的文档列表,为摄取分块文件做好准备。您可以使用刚刚定义的
get_embedding
函数生成这些嵌入。# Prepare documents for insertion docs_to_insert = [{ "text": doc.page_content, "embedding": get_embedding(doc.page_content) } for doc in documents] 将数据和嵌入存储在 Atlas 中
运行此代码,将包含嵌入内容的文档插入到 Atlas 集群的
rag_db.test
集合中。在运行代码之前,请将<connection-string>
替换为您的 Atlas 连接字符串。from pymongo import MongoClient # Connect to your Atlas cluster client = MongoClient("<connection-string>") collection = client["rag_db"]["test"] # Insert documents into the collection result = collection.insert_many(docs_to_insert) 提示
运行代码后,您可以导航到集群中的
rag_db.test
集合,在 Atlas UI 中查看向量嵌入。
使用 Atlas Vector Search 检索文档。
在本节中,您将使用 Atlas Vector Search 创建一个检索系统,以从向量数据库中获取相关文档。在笔记本中粘贴并运行以下每个代码片段:
在向量嵌入上创建 Atlas Vector Search 索引。
运行以下代码,使用 PyMongo 驱动程序直接从应用程序创建索引。此代码还包括一个轮询机制,用于检查索引是否已准备好使用。
要了解详情,请参阅如何为向量搜索建立字段索引。
from pymongo.operations import SearchIndexModel import time # Create your index model, then create the search index index_name="vector_index" search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "numDimensions": 768, "path": "embedding", "similarity": "cosine" } ] }, name = index_name, type = "vectorSearch" ) collection.create_search_index(model=search_index_model) # Wait for initial sync to complete print("Polling to check if the index is ready. This may take up to a minute.") predicate=None if predicate is None: predicate = lambda index: index.get("queryable") is True while True: indices = list(collection.list_search_indexes(index_name)) if len(indices) and predicate(indices[0]): break time.sleep(5) print(index_name + " is ready for querying.") 定义一个函数来运行向量搜索查询。
运行此代码来创建一个名为
get_query_results
的检索函数,该函数运行基本的向量搜索查询。它使用get_embedding
函数从搜索查询创建嵌入。然后,它运行查询以返回语义相似的文档。要了解更多信息,请参阅运行向量搜索查询。
# Define a function to run vector search queries def get_query_results(query): """Gets results from a vector search query.""" query_embedding = get_embedding(query) pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "text": 1 } } ] results = collection.aggregate(pipeline) array_of_results = [] for doc in results: array_of_results.append(doc) return array_of_results # Test the function with a sample query import pprint pprint.pprint(get_query_results("AI technology")) [{'text': 'more of our customers. We also see a tremendous opportunity to win ' 'more legacy workloads, as AI has now become a catalyst to modernize ' 'these\n' "applications. MongoDB's document-based architecture is " 'particularly well-suited for the variety and scale of data required ' 'by AI-powered applications.'}, {'text': 'artificial intelligence, in our offerings or partnerships; the ' 'growth and expansion of the market for database products and our ' 'ability to penetrate that\n' 'market; our ability to integrate acquired businesses and ' 'technologies successfully or achieve the expected benefits of such ' 'acquisitions; our ability to'}, {'text': 'MongoDB continues to expand its AI ecosystem with the announcement ' 'of the MongoDB AI Applications Program (MAAP),'}, {'text': 'which provides customers with reference architectures, pre-built ' 'partner integrations, and professional services to help\n' 'them quickly build AI-powered applications. Accenture will ' 'establish a center of excellence focused on MongoDB projects,\n' 'and is the first global systems integrator to join MAAP.'}, {'text': 'Bendigo and Adelaide Bank partnered with MongoDB to modernize ' 'their core banking technology. With the help of\n' 'MongoDB Relational Migrator and generative AI-powered modernization ' 'tools, Bendigo and Adelaide Bank decomposed an\n' 'outdated consumer-servicing application into microservices and ' 'migrated off its underlying legacy relational database'}]
使用 LLM 生成响应。
在本节中,您将通过提示 LLM 使用检索到的文档作为上下文来生成响应。
将以下代码中的 <token>
替换为您的 Hugging Face 访问令牌,然后在笔记本中运行该代码。此代码执行以下操作:
使用您定义的
get_query_results
函数从 Atlas 检索相关文档。使用用户的问题和检索到的文档作为上下文来创建提示。
从 Hugging Face 的模型中心访问 Mistral 7B Instruct 模型。
向 LLM 提示有关 MongoDB 最新的 AI 公告。 产生的响应可能会有所不同。
import os from huggingface_hub import InferenceClient # Specify search query, retrieve relevant documents, and convert to string query = "What are MongoDB's latest AI announcements?" context_docs = get_query_results(query) context_string = " ".join([doc["text"] for doc in context_docs]) # Construct prompt for the LLM using the retrieved documents as the context prompt = f"""Use the following pieces of context to answer the question at the end. {context_string} Question: {query} """ # Authenticate to Hugging Face and access the model os.environ["HF_TOKEN"] = "<token>" llm = InferenceClient( "mistralai/Mistral-7B-Instruct-v0.3", token = os.getenv("HF_TOKEN")) # Prompt the LLM (this code varies depending on the model you use) output = llm.chat_completion( messages=[{"role": "user", "content": prompt}], max_tokens=150 ) print(output.choices[0].message.content)
MongoDB's latest AI announcements include the MongoDB AI Applications Program (MAAP), a program designed to help customers build AI-powered applications more efficiently. Additionally, they have announced significant performance improvements in MongoDB 8.0, featuring faster reads, updates, bulk inserts, and time series queries. Another announcement is the general availability of Atlas Stream Processing to build sophisticated, event-driven applications with real-time data.
后续步骤
有关更详细的 RAG 教程,请参阅以下资源:
要了解如何利用流行的 LLM 框架和 AI 服务实施 RAG,请参阅将 Vector Search 与 AI 技术集成。
要学习;了解如何使用本地Atlas部署和本地模型来实现RAG,请参阅使用Atlas Vector Search构建本地 RAG 实施。
有关基于用例的教程和交互式 Python 笔记本,请参阅生成式人工智能用例库。
要开始使用 Atlas Vector Search 构建生产就绪的聊天机器人,可以使用 MongoDB 聊天机器人框架。此框架提供了一组库,使您能够快速构建 AI 聊天机器人应用程序。
微调
要优化和微调 RAG 应用程序,请参阅如何衡量查询结果的准确性并提高向量搜索性能。
您还可以尝试不同的嵌入模型、分块策略和 LLM。要学习;了解详情,请参阅以下资源:
此外,Atlas Vector Search 支持高级检索系统。您能够把向量数据和Atlas中的其他数据无缝集成到索引中,这样可以对集合中的其他字段进行预过滤或混合搜索,通过结合语义搜索和全文搜索的结果,更精细地调整检索结果。