如何创建向量嵌入
您可以将向量嵌入与其他数据一起存储在 Atlas 中。这些嵌入可以捕获数据中的有意义的关系,使您能够执行语义搜索并使用 Atlas Vector Search 实现 RAG。
开始体验
通过以下教程,学习;了解如何创建向量嵌入并使用Atlas Vector Search进行查询。 具体来说,您执行以下操作:
对于生产应用程序,您通常会写入脚本来生成向量嵌入。 您可以从此页面上的示例代码开始,并根据您的使用案例进行自定义。
先决条件
➤ 使用 Select your language(选择您的语言)下拉菜单设置此页面上示例的语言。
如要完成本教程,您必须具备以下条件:
一个Atlas账户,其集群运行的是MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
用于运行 Go 项目的终端和代码编辑器。
Go 已安装。
Hugging Face 访问令牌或 OpenAI API 密钥。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
Java 开发工具包 (JDK) 版本 8 或更高版本。
设立和运行Java应用程序的环境。我们建议您使用 IntelliJ IDEA 或 Eclipse IDE 等集成开发环境来配置 Maven 或 Gradle,以构建和运行项目。
以下之一:
一个 OpenAI API 密钥。您必须拥有付费的 OpenAI 账号,该账号具有可用于 API 请求的信用额度。要了解有关注册 OpenAI 账号的更多信息,请参阅 OpenAI API 网站。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
用于运行 Node.js 项目的终端和代码编辑器。
npm 和 Node.js 已安装。
如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
运行交互式 Python 笔记本(如 VS Code 或 Colab) 的环境。
如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。
定义嵌入函数
在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您想要使用开源嵌入模型还是 OpenAI 的专有模型选择一个标签页。
注意
开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。
创建.env
文件来管理密钥。
在项目中,创建 .env
文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。
HUGGINGFACEHUB_API_TOKEN = "<access-token>" ATLAS_CONNECTION_STRING = "<connection-string>"
将 <access-token>
占位符值替换为您的 Huging Face访问权限令牌。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义一个函数来生成向量嵌入。
在项目中创建一个名为
common
的目录,用于存储您将在后续步骤中使用的常用代码:mkdir common && cd common 创建一个名为
get-embeddings.go
的文件并粘贴以下代码。此代码定义了一个名为GetEmbeddings
的函数,用于为给定输入生成嵌入。此函数指定:feature-extraction
任务使用 LangChain 库的 Go 端口。要了解更多信息,请参阅 LangChain JavaScript 文档中的任务文档。mxbai-embed-large-v1 嵌入模型。
get-embeddings.gopackage common import ( "context" "log" "github.com/tmc/langchaingo/embeddings/huggingface" ) func GetEmbeddings(documents []string) [][]float32 { hf, err := huggingface.NewHuggingface( huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"), huggingface.WithTask("feature-extraction")) if err != nil { log.Fatalf("failed to connect to Hugging Face: %v", err) } embs, err := hf.EmbedDocuments(context.Background(), documents) if err != nil { log.Fatalf("failed to generate embeddings: %v", err) } return embs } 注意
503 调用 Hushing Face 模型时
在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。
返回到主项目根目录。
cd ../
创建.env
文件来管理密钥。
在您的项目中,创建一个.env
string文件来存储连接字符串和 OpenAIAPI API令牌。
OPENAI_API_KEY = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key>
和 <connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义一个函数来生成向量嵌入。
在项目中创建一个名为
common
的目录,用于存储将在多个步骤中使用的代码:mkdir common && cd common 创建一个名为
get-embeddings.go
的文件并粘贴以下代码。此代码定义了一个名为GetEmbeddings
的函数,该函数使用 OpenAI 的text-embedding-3-small
模型为给定输入生成嵌入。get-embeddings.gopackage common import ( "context" "log" "github.com/milosgajdos/go-embeddings/openai" ) func GetEmbeddings(docs []string) [][]float64 { c := openai.NewClient() embReq := &openai.EmbeddingRequest{ Input: docs, Model: openai.TextSmallV3, EncodingFormat: openai.EncodingFloat, } embs, err := c.Embed(context.Background(), embReq) if err != nil { log.Fatalf("failed to connect to OpenAI: %v", err) } var vectors [][]float64 for _, emb := range embs { vectors = append(vectors, emb.Vector) } return vectors } 返回到主项目根目录。
cd ../
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件的dependencies
大量中:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for working with Hugging Face models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-hugging-face</artifactId> <version>0.35.0</version> </dependency> </dependencies> 如果您使用 Gradle,请将以下内容添加到项目
build.gradle
文件的dependencies
大量中:build.gradledependencies { // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)' // Java library for working with Hugging Face models implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0' } 运行包管理器以安装项目。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后在 Environment variables字段中将变量添加为分号分隔的值(示例
FOO=123;BAR=456
)。应用更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
HUGGING_FACE_ACCESS_TOKEN=<access-token> ATLAS_CONNECTION_STRING=<connection-string>
使用以下值更新占位符:
将``<access-token>`` 占位符值替换为您的 Huging Face访问权限令牌。
用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义生成向量嵌入的方法。
创建一个名为EmbeddingProvider.java
的文件并粘贴以下代码。
此代码定义了两种使用 mxbai-embed-large-v1 开源嵌入模型为给定输入生成嵌入的方法:
getEmbeddings
多个输入:List<String>
方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。单个输入:
getEmbedding
方法接受单个String
,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static HuggingFaceEmbeddingModel embeddingModel; private static HuggingFaceEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN"); if (accessToken == null || accessToken.isEmpty()) { throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty."); } embeddingModel = HuggingFaceEmbeddingModel.builder() .accessToken(accessToken) .modelId("mixedbread-ai/mxbai-embed-large-v1") .waitForModel(true) .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Takes an array of strings and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<String> texts) { List<TextSegment> textSegments = texts.stream() .map(TextSegment::from) .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件的dependencies
大量中:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for working with OpenAI models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-open-ai</artifactId> <version>0.35.0</version> </dependency> </dependencies> 如果您使用 Gradle,请将以下内容添加到项目
build.gradle
文件的dependencies
大量中:build.gradledependencies { // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)' // Java library for working with OpenAI models implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0' } 运行包管理器以安装项目。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后将变量作为分号分隔值添加到 Environment variables字段(示例
FOO=123;BAR=456
)。应用更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
OPEN_AI_API_KEY=<api-key> ATLAS_CONNECTION_STRING=<connection-string>
使用以下值更新占位符:
将``<api-key>`` 占位符值替换为您的 OpenAI API密钥。
用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义生成向量嵌入的方法。
创建一个名为EmbeddingProvider.java
的文件并粘贴以下代码。
此代码定义了两种使用 text-embedding-3 -small OpenAI 嵌入模型为给定输入生成嵌入的方法:
getEmbeddings
多个输入:List<String>
方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。单个输入:
getEmbedding
方法接受单个String
,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.openai.OpenAiEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static OpenAiEmbeddingModel embeddingModel; private static OpenAiEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String apiKey = System.getenv("OPEN_AI_API_KEY"); if (apiKey == null || apiKey.isEmpty()) { throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty."); } return OpenAiEmbeddingModel.builder() .apiKey(apiKey) .modelName("text-embedding-3-small") .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Takes an array of strings and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<String> texts) { List<TextSegment> textSegments = texts.stream() .map(TextSegment::from) .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
更新您的 package.json
文件。
将您的项目配置为使用 ES 模块 ,方法是将 "type": "module"
添加到 package.json
文件中,然后将其保存。
{ "type": "module", // other fields... }
创建 .env
文件。
在您的项目中,创建一个 .env
文件来存储您的 Atlas 连接字符串。
ATLAS_CONNECTION_STRING = "<connection-string>"
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
注意
最低 Node.js 版本要求
Node.js v 20 .x 引入了 --env-file
选项。如果您使用的是旧版本的 Node.js,请将 dotenv
包添加到项目中,或使用其他方法来管理环境变量。
定义一个函数来生成向量嵌入。
创建一个名为 get-embeddings.js
的文件并粘贴以下代码。这段代码定义了一个被命名用于为给定输入生成嵌入的函数。此函数指定:
来自 Hugging Face 的transformers.js 库的
feature-extraction
任务。要了解更多信息,请参阅任务。nomic-embed-text-v1 嵌入模型。
import { pipeline } from '@xenova/transformers'; // Function to generate embeddings for a given data source export async function getEmbedding(data) { const embedder = await pipeline( 'feature-extraction', 'Xenova/nomic-embed-text-v1'); const results = await embedder(data, { pooling: 'mean', normalize: true }); return Array.from(results.data); }
更新您的 package.json
文件。
将您的项目配置为使用 ES 模块 ,方法是将 "type": "module"
添加到 package.json
文件中,然后将其保存。
{ "type": "module", // other fields... }
创建 .env
文件。
在项目中,创建一个 .env
文件来存储 Atlas 连接字符串和 OpenAI API 密钥。
OPENAI_API_KEY = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key>
和 <connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
注意
最低 Node.js 版本要求
Node.js v 20 .x 引入了 --env-file
选项。如果您使用的是旧版本的 Node.js,请将 dotenv
包添加到项目中,或使用其他方法来管理环境变量。
定义一个函数来生成向量嵌入。
创建一个名为 get-embeddings.js
的文件并粘贴以下代码。此代码定义了一个名为 getEmbedding
的函数,该函数使用 OpenAI 的 text-embedding-3-small
模型为给定输入生成嵌入。
import OpenAI from 'openai'; // Setup OpenAI configuration const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY}); // Function to get the embeddings using the OpenAI API export async function getEmbedding(text) { const results = await openai.embeddings.create({ model: "text-embedding-3-small", input: text, encoding_format: "float", }); return results.data[0].embedding; }
定义一个函数来生成向量嵌入。
在笔记本中粘贴并运行以下代码,以创建一个函数,该函数使用 Nomic AI 的开源嵌入模型生成向量嵌入。此代码执行以下操作:
加载 nomic-embed-text-v1 嵌入模型。
创建名为
get_embedding
的函数,该函数使用模型为给定的文本输入生成嵌入。通过为字符串
foo
生成单个嵌入来测试函数。
from sentence_transformers import SentenceTransformer # Load the embedding model (https://huggingface.co/nomic-ai/nomic-embed-text-v1") model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True) # Define a function to generate embeddings def get_embedding(data): """Generates vector embeddings for the given data.""" embedding = model.encode(data) return embedding.tolist() # Generate an embedding get_embedding("foo")
[-0.029808253049850464, 0.03841473162174225, -0.02561120130121708, -0.06707508116960526, 0.03867151960730553, ... ]
定义一个函数来生成向量嵌入。
在笔记本中粘贴并运行以下代码,创建一个函数,该函数利用 OpenAI 的专有嵌入模型生成向量嵌入。用您的 OpenAI API 密钥替换 <api-key>
。此代码执行以下操作:
指定
text-embedding-3-small
内嵌模型。创建一个名为
get_embedding
的函数,该函数调用模型的 API 来为给定的文本输入生成嵌入。通过为字符串
foo
生成单个嵌入来测试函数。
import os from openai import OpenAI # Specify your OpenAI API key and embedding model os.environ["OPENAI_API_KEY"] = "<api-key>" model = "text-embedding-3-small" openai_client = OpenAI() # Define a function to generate embeddings def get_embedding(text): """Generates vector embeddings for the given text.""" embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding return embedding # Generate an embedding get_embedding("foo")
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]
从数据创建嵌入
在本部分中,您将使用您定义的函数从数据创建向量嵌入,然后存储这些嵌入存储在Atlas的集合中。
根据要从新数据还是从 Atlas 中已有的现有数据创建嵌入,选择一个标签页。
创建一个名为 create-embeddings.go
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。
具体来说,这段代码使用您定义的 GetEmbeddings
函数和 MongoDB Go 驱动程序,从示例文本数组生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings
集合中。
package main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) var data = []string{ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", } type TextWithEmbedding struct { Text string Embedding []float32 } func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") embeddings := common.GetEmbeddings(data) docsToInsert := make([]interface{}, len(embeddings)) for i, string := range data { docsToInsert[i] = TextWithEmbedding{ Text: string, Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) }
package main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) var data = []string{ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", } type TextWithEmbedding struct { Text string Embedding []float64 } func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") embeddings := common.GetEmbeddings(data) docsToInsert := make([]interface{}, len(data)) for i, movie := range data { docsToInsert[i] = TextWithEmbedding{ Text: movie, Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) }
保存并运行该文件。
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
创建一个名为 create-embeddings.go
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您定义的
GetEmbeddings
函数,从每个文档的summary
字段生成嵌入。使用 MongoDB Go 驱动程序,用包含嵌入值的新
embeddings
字段更新每个文档。
package main import ( "context" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") filter := bson.D{ {"$and", bson.A{ bson.D{ {"$and", bson.A{ bson.D{{"summary", bson.D{{"$exists", true}}}}, bson.D{{"summary", bson.D{{"$ne", ""}}}}, }, }}, bson.D{{"embeddings", bson.D{{"$exists", false}}}}, }}, } opts := options.Find().SetLimit(50) cursor, err := coll.Find(ctx, filter, opts) if err != nil { log.Fatalf("failed to retrieve documents: %v", err) } var listings []common.Listing if err = cursor.All(ctx, &listings); err != nil { log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err) } var summaries []string for _, listing := range listings { summaries = append(summaries, listing.Summary) } log.Println("Generating embeddings.") embeddings := common.GetEmbeddings(summaries) docsToUpdate := make([]mongo.WriteModel, len(listings)) for i := range listings { docsToUpdate[i] = mongo.NewUpdateOneModel(). SetFilter(bson.D{{"_id", listings[i].ID}}). SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}}) } bulkWriteOptions := options.BulkWrite().SetOrdered(false) result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions) if err != nil { log.Fatalf("failed to write embeddings to existing documents: %v", err) } log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount) }
创建一个包含该集合的Go模型的文件。
为了简化与BSON之间的编组和解组Go对象,请创建一个包含此集合中文档模型的文件。
进入
common
目录。cd common 创建一个名为
models.go
的文件,并将以下代码粘贴到其中:models.gopackage common import ( "time" "go.mongodb.org/mongo-driver/bson/primitive" ) type Image struct { ThumbnailURL string `bson:"thumbnail_url"` MediumURL string `bson:"medium_url"` PictureURL string `bson:"picture_url"` XLPictureURL string `bson:"xl_picture_url"` } type Host struct { ID string `bson:"host_id"` URL string `bson:"host_url"` Name string `bson:"host_name"` Location string `bson:"host_location"` About string `bson:"host_about"` ThumbnailURL string `bson:"host_thumbnail_url"` PictureURL string `bson:"host_picture_url"` Neighborhood string `bson:"host_neighborhood"` IsSuperhost bool `bson:"host_is_superhost"` HasProfilePic bool `bson:"host_has_profile_pic"` IdentityVerified bool `bson:"host_identity_verified"` ListingsCount int32 `bson:"host_listings_count"` TotalListingsCount int32 `bson:"host_total_listings_count"` Verifications []string `bson:"host_verifications"` } type Location struct { Type string `bson:"type"` Coordinates []float64 `bson:"coordinates"` IsLocationExact bool `bson:"is_location_exact"` } type Address struct { Street string `bson:"street"` Suburb string `bson:"suburb"` GovernmentArea string `bson:"government_area"` Market string `bson:"market"` Country string `bson:"Country"` CountryCode string `bson:"country_code"` Location Location `bson:"location"` } type Availability struct { Thirty int32 `bson:"availability_30"` Sixty int32 `bson:"availability_60"` Ninety int32 `bson:"availability_90"` ThreeSixtyFive int32 `bson:"availability_365"` } type ReviewScores struct { Accuracy int32 `bson:"review_scores_accuracy"` Cleanliness int32 `bson:"review_scores_cleanliness"` CheckIn int32 `bson:"review_scores_checkin"` Communication int32 `bson:"review_scores_communication"` Location int32 `bson:"review_scores_location"` Value int32 `bson:"review_scores_value"` Rating int32 `bson:"review_scores_rating"` } type Review struct { ID string `bson:"_id"` Date time.Time `bson:"date,omitempty"` ListingId string `bson:"listing_id"` ReviewerId string `bson:"reviewer_id"` ReviewerName string `bson:"reviewer_name"` Comments string `bson:"comments"` } type Listing struct { ID string `bson:"_id"` ListingURL string `bson:"listing_url"` Name string `bson:"name"` Summary string `bson:"summary"` Space string `bson:"space"` Description string `bson:"description"` NeighborhoodOverview string `bson:"neighborhood_overview"` Notes string `bson:"notes"` Transit string `bson:"transit"` Access string `bson:"access"` Interaction string `bson:"interaction"` HouseRules string `bson:"house_rules"` PropertyType string `bson:"property_type"` RoomType string `bson:"room_type"` BedType string `bson:"bed_type"` MinimumNights string `bson:"minimum_nights"` MaximumNights string `bson:"maximum_nights"` CancellationPolicy string `bson:"cancellation_policy"` LastScraped time.Time `bson:"last_scraped,omitempty"` CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"` FirstReview time.Time `bson:"first_review,omitempty"` LastReview time.Time `bson:"last_review,omitempty"` Accommodates int32 `bson:"accommodates"` Bedrooms int32 `bson:"bedrooms"` Beds int32 `bson:"beds"` NumberOfReviews int32 `bson:"number_of_reviews"` Bathrooms primitive.Decimal128 `bson:"bathrooms"` Amenities []string `bson:"amenities"` Price primitive.Decimal128 `bson:"price"` WeeklyPrice primitive.Decimal128 `bson:"weekly_price"` MonthlyPrice primitive.Decimal128 `bson:"monthly_price"` CleaningFee primitive.Decimal128 `bson:"cleaning_fee"` ExtraPeople primitive.Decimal128 `bson:"extra_people"` GuestsIncluded primitive.Decimal128 `bson:"guests_included"` Image Image `bson:"images"` Host Host `bson:"host"` Address Address `bson:"address"` Availability Availability `bson:"availability"` ReviewScores ReviewScores `bson:"review_scores"` Reviews []Review `bson:"reviews"` Embeddings []float32 `bson:"embeddings,omitempty"` } models.gopackage common import ( "time" "go.mongodb.org/mongo-driver/bson/primitive" ) type Image struct { ThumbnailURL string `bson:"thumbnail_url"` MediumURL string `bson:"medium_url"` PictureURL string `bson:"picture_url"` XLPictureURL string `bson:"xl_picture_url"` } type Host struct { ID string `bson:"host_id"` URL string `bson:"host_url"` Name string `bson:"host_name"` Location string `bson:"host_location"` About string `bson:"host_about"` ThumbnailURL string `bson:"host_thumbnail_url"` PictureURL string `bson:"host_picture_url"` Neighborhood string `bson:"host_neighborhood"` IsSuperhost bool `bson:"host_is_superhost"` HasProfilePic bool `bson:"host_has_profile_pic"` IdentityVerified bool `bson:"host_identity_verified"` ListingsCount int32 `bson:"host_listings_count"` TotalListingsCount int32 `bson:"host_total_listings_count"` Verifications []string `bson:"host_verifications"` } type Location struct { Type string `bson:"type"` Coordinates []float64 `bson:"coordinates"` IsLocationExact bool `bson:"is_location_exact"` } type Address struct { Street string `bson:"street"` Suburb string `bson:"suburb"` GovernmentArea string `bson:"government_area"` Market string `bson:"market"` Country string `bson:"Country"` CountryCode string `bson:"country_code"` Location Location `bson:"location"` } type Availability struct { Thirty int32 `bson:"availability_30"` Sixty int32 `bson:"availability_60"` Ninety int32 `bson:"availability_90"` ThreeSixtyFive int32 `bson:"availability_365"` } type ReviewScores struct { Accuracy int32 `bson:"review_scores_accuracy"` Cleanliness int32 `bson:"review_scores_cleanliness"` CheckIn int32 `bson:"review_scores_checkin"` Communication int32 `bson:"review_scores_communication"` Location int32 `bson:"review_scores_location"` Value int32 `bson:"review_scores_value"` Rating int32 `bson:"review_scores_rating"` } type Review struct { ID string `bson:"_id"` Date time.Time `bson:"date,omitempty"` ListingId string `bson:"listing_id"` ReviewerId string `bson:"reviewer_id"` ReviewerName string `bson:"reviewer_name"` Comments string `bson:"comments"` } type Listing struct { ID string `bson:"_id"` ListingURL string `bson:"listing_url"` Name string `bson:"name"` Summary string `bson:"summary"` Space string `bson:"space"` Description string `bson:"description"` NeighborhoodOverview string `bson:"neighborhood_overview"` Notes string `bson:"notes"` Transit string `bson:"transit"` Access string `bson:"access"` Interaction string `bson:"interaction"` HouseRules string `bson:"house_rules"` PropertyType string `bson:"property_type"` RoomType string `bson:"room_type"` BedType string `bson:"bed_type"` MinimumNights string `bson:"minimum_nights"` MaximumNights string `bson:"maximum_nights"` CancellationPolicy string `bson:"cancellation_policy"` LastScraped time.Time `bson:"last_scraped,omitempty"` CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"` FirstReview time.Time `bson:"first_review,omitempty"` LastReview time.Time `bson:"last_review,omitempty"` Accommodates int32 `bson:"accommodates"` Bedrooms int32 `bson:"bedrooms"` Beds int32 `bson:"beds"` NumberOfReviews int32 `bson:"number_of_reviews"` Bathrooms primitive.Decimal128 `bson:"bathrooms"` Amenities []string `bson:"amenities"` Price primitive.Decimal128 `bson:"price"` WeeklyPrice primitive.Decimal128 `bson:"weekly_price"` MonthlyPrice primitive.Decimal128 `bson:"monthly_price"` CleaningFee primitive.Decimal128 `bson:"cleaning_fee"` ExtraPeople primitive.Decimal128 `bson:"extra_people"` GuestsIncluded primitive.Decimal128 `bson:"guests_included"` Image Image `bson:"images"` Host Host `bson:"host"` Address Address `bson:"address"` Availability Availability `bson:"availability"` ReviewScores ReviewScores `bson:"review_scores"` Reviews []Review `bson:"reviews"` Embeddings []float64 `bson:"embeddings,omitempty"` } 返回到项目根目录。
cd ../
生成嵌入。
go run create-embeddings.go
2024/10/10 09:58:03 Generating embeddings. 2024/10/10 09:58:12 Successfully added embeddings to 50 documents
您可以通过导航到 Atlas UI 中的 sample_airbnb.listingsAndReviews
集合来查看生成的矢量嵌入。
定义代码以从Atlas中的现有集合生成嵌入。
创建一个名为CreateEmbeddings.java
的文件并粘贴以下代码。
此代码使用getEmbeddings
方法和MongoDB Java同步驱动程序来执行以下操作:
连接到您的 Atlas 集群。
获取示例文本大量。
使用您之前定义的
getEmbeddings
方法从每个文本生成嵌入。将嵌入引入Atlas中的
sample_db.embeddings
集合。
import com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.result.InsertManyResult; import org.bson.BsonArray; import org.bson.Document; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class CreateEmbeddings { static List<String> data = Arrays.asList( "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ); public static void main(String[] args){ String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); System.out.println("Creating embeddings for " + data.size() + " documents"); EmbeddingProvider embeddingProvider = new EmbeddingProvider(); // generate embeddings for new inputted data List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data); List<Document> documents = new ArrayList<>(); int i = 0; for (String text : data) { Document doc = new Document("text", text).append("embedding", embeddings.get(i)); documents.add(doc); i++; } // insert the embeddings into the Atlas collection List<String> insertedIds = new ArrayList<>(); try { InsertManyResult result = collection.insertMany(documents); result.getInsertedIds().values() .forEach(doc -> insertedIds.add(doc.toString())); System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } }
生成嵌入。
保存并运行该文件。 输出类似如下所示:
Creating embeddings for 3 documents Inserted 3 documents with the following ids to sample_db.embeddings collection: [BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
定义代码以从Atlas中的现有集合生成嵌入。
创建一个名为CreateEmbeddings.java
的文件并粘贴以下代码。
此代码使用getEmbeddings
方法和MongoDB Java同步驱动程序来执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您之前定义的
getEmbeddings
方法,从每个文档的summary
字段生成嵌入。使用包含嵌入值的新
embeddings
字段更新每个文档。
import com.mongodb.MongoException; import com.mongodb.bulk.BulkWriteResult; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.BulkWriteOptions; import com.mongodb.client.model.Filters; import com.mongodb.client.model.UpdateOneModel; import com.mongodb.client.model.Updates; import com.mongodb.client.model.WriteModel; import org.bson.BsonArray; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; public class CreateEmbeddings { public static void main(String[] args){ String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); Bson filterCriteria = Filters.and( Filters.and(Filters.exists("summary"), Filters.ne("summary", null), Filters.ne("summary", "")), Filters.exists("embeddings", false)); try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) { List<String> summaries = new ArrayList<>(); List<String> documentIds = new ArrayList<>(); int i = 0; while (cursor.hasNext()) { Document document = cursor.next(); String summary = document.getString("summary"); String id = document.get("_id").toString(); summaries.add(summary); documentIds.add(id); i++; } System.out.println("Generating embeddings for " + summaries.size() + " documents."); System.out.println("This operation may take up to several minutes."); EmbeddingProvider embeddingProvider = new EmbeddingProvider(); List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries); List<WriteModel<Document>> updateDocuments = new ArrayList<>(); for (int j = 0; j < summaries.size(); j++) { UpdateOneModel<Document> updateDoc = new UpdateOneModel<>( Filters.eq("_id", documentIds.get(j)), Updates.set("embeddings", embeddings.get(j))); updateDocuments.add(updateDoc); } int updatedDocsCount = 0; try { BulkWriteOptions options = new BulkWriteOptions().ordered(false); BulkWriteResult result = collection.bulkWrite(updateDocuments, options); updatedDocsCount = result.getModifiedCount(); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents."); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } }
生成嵌入。
保存并运行该文件。 输出类似如下所示:
Generating embeddings for 50 documents. This operation may take up to several minutes. Added embeddings successfully to 50 documents.
您还可以导航到集群中的 sample_airbnb.listingsAndReviews
集合,在 Atlas UI 中查看向量嵌入。
创建一个名为 create-embeddings.js
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。
具体来说,这段代码使用您定义的 getEmbedding
函数和 MongoDB Node.js 驱动程序,从示例文本数组中生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings
集合中。
import { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // Data to embed const data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] async function run() { // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); try { await client.connect(); const db = client.db("sample_db"); const collection = db.collection("embeddings"); await Promise.all(data.map(async text => { // Check if the document already exists const existingDoc = await collection.findOne({ text: text }); // Generate an embedding by using the function that you defined const embedding = await getEmbedding(text); // Ingest data and embedding into Atlas if (!existingDoc) { await collection.insertOne({ text: text, embedding: embedding }); console.log(embedding); } })); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
保存并运行该文件。
node --env-file=.env create-embeddings.js
[ -0.04323853924870491, -0.008460805751383305, 0.012494648806750774, -0.013014335185289383, ... ] [ -0.017400473356246948, 0.04922063276171684, -0.002836339408531785, -0.030395228415727615, ... ] [ -0.016950927674770355, 0.013881809078156948, -0.022074559703469276, -0.02838018536567688, ... ]
node --env-file=.env create-embeddings.js
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ] [ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ] [ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]
注意
为了便于阅读,输出中的维数已被截断。
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
创建一个名为 create-embeddings.js
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您定义的
getEmbedding
函数,从每个文档的summary
字段生成嵌入。使用 MongoDB Node.js 驱动程序,用包含嵌入值的新
embedding
字段更新每个文档。
import { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { await client.connect(); const db = client.db("sample_airbnb"); const collection = db.collection("listingsAndReviews"); // Filter to exclude null or empty summary fields const filter = { "summary": { "$nin": [ null, "" ] } }; // Get a subset of documents from the collection const documents = await collection.find(filter).limit(50).toArray(); // Create embeddings from a field in the collection let updatedDocCount = 0; console.log("Generating embeddings for documents..."); await Promise.all(documents.map(async doc => { // Generate an embedding by using the function that you defined const embedding = await getEmbedding(doc.summary); // Update the document with a new embedding field await collection.updateOne({ "_id": doc._id }, { "$set": { "embedding": embedding } } ); updatedDocCount += 1; })); console.log("Count of documents updated: " + updatedDocCount); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
保存并运行该文件。
node --env-file=.env create-embeddings.js
Generating embeddings for documents... Count of documents updated: 50
sample_airbnb.listingsAndReviews
您可以导航到Atlas用户界面中的 集合并展开文档中的字段,以查看生成的向量嵌入。
请将以下代码粘贴到您的笔记本中。
请使用以下代码从新数据生成嵌入。
具体来说,这段代码使用您定义的 get_embedding
函数和 MongoDB PyMongo 驱动程序,从示例文本数组中生成嵌入,并将其摄取到 sample_db.embeddings
集合中。
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_db"] collection = db["embeddings"] # Sample data data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] # Ingest data into Atlas inserted_doc_count = 0 for text in data: embedding = get_embedding(text) collection.insert_one({ "text": text, "embedding": embedding }) inserted_doc_count += 1 print(f"Inserted {inserted_doc_count} documents.")
Inserted 3 documents.
指定连接字符串。
将 <connection-string>
替换为 Atlas 集群的 SRV 连接字符串。
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
运行代码。
您可以通过导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入来验证它们。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
请将以下代码粘贴到您的笔记本中。
使用以下代码从现有集合中的字段生成嵌入。具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您定义的
get_embedding
函数,从每个文档的summary
字段生成嵌入。使用 MongoDB PyMongo 驱动程序,用包含嵌入值的新
embedding
字段更新每个文档。
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_airbnb"] collection = db["listingsAndReviews"] # Filter to exclude null or empty summary fields filter = { "summary": {"$nin": [ None, "" ]} } # Get a subset of documents in the collection documents = collection.find(filter).limit(50) # Update each document with a new embedding field updated_doc_count = 0 for doc in documents: embedding = get_embedding(doc["summary"]) collection.update_one( { "_id": doc["_id"] }, { "$set": { "embedding": embedding } } ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.
指定连接字符串。
将 <connection-string>
替换为 Atlas 集群的 SRV 连接字符串。
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
运行代码。
sample_airbnb.listingsAndReviews
您可以导航到Atlas用户界面中的 集合并展开文档中的字段,以查看生成的向量嵌入。
为查询创建嵌入
在本节中,您将对集合中的向量嵌入进行索引,并创建一个嵌入,用于运行示例向量搜索查询。
运行查询时,Atlas Vector Search 会返回嵌入距离与向量搜索查询中的嵌入距离最接近的文档。这表明它们的含义相似。
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
create-index.go
的文件并粘贴以下代码。create-index.gopackage main import ( "context" "fmt" "log" "os" "time" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embedding", NumDimensions: <dimensions>, Similarity: "dotProduct"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { fmt.Errorf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } 如果使用开源模型,则将
<dimensions>
占位符值替换为1024
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
go run create-index.go 2024/10/09 17:38:51 Creating the index. 2024/10/09 17:38:52 Polling to confirm successful index creation. 2024/10/09 17:39:22 Name of Index Created: vector_index
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.go
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextAndScore struct { Text string `bson:"text"` Score float32 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") query := "ocean tragedy" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embedding"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"text", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []TextAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score) } } vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextAndScore struct { Text string `bson:"text"` Score float64 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") query := "ocean tragedy" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embedding"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"text", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []TextAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score) } } 保存文件,然后运行以下命令:
go run vector-query.go Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.0042472864 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.0031167597 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.0024476869 go run vector-query.go Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.4552372694015503 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.4050072133541107 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.35942140221595764
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embeddings
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
create-index.go
的文件并粘贴以下代码。create-index.gopackage main import ( "context" "fmt" "log" "os" "time" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embeddings", NumDimensions: <dimensions>, Similarity: "dotProduct"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { fmt.Errorf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } 如果使用开源模型,则将
<dimensions>
占位符值替换为1024
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
go run create-index.go 2024/10/10 10:03:12 Creating the index. 2024/10/10 10:03:13 Polling to confirm successful index creation. 2024/10/10 10:03:44 Name of Index Created: vector_index
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.go
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type SummaryAndScore struct { Summary string `bson:"summary"` Score float32 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") query := "beach house" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embeddings"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"summary", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []SummaryAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score) } } vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type SummaryAndScore struct { Summary string `bson:"summary"` Score float64 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") query := "beach house" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embeddings"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"summary", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []SummaryAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score) } } 保存文件,然后运行以下命令:
go run vector-query.go Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living. Score: 0.0045180833 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.004480799 Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway. Score: 0.0042421296 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.004227752 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.0042201905 go run vector-query.go Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.4832950830459595 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.48093676567077637 Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!! Score: 0.4629695415496826 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.45800843834877014 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2. Score: 0.45398443937301636
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
CreateIndex.java
的文件并粘贴以下代码:CreateIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class CreateIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); // define the index details String indexName = "vector_index"; int dimensionsHuggingFaceModel = 1024; int dimensionsOpenAiModel = 1536; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embedding") .append("numDimensions", <dimensions>) // replace with var for the model used .append("similarity", "dotProduct"))); // define the index model using the specified details SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // Create the index using the model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created a vector index named: " + result); } catch (Exception e) { throw new RuntimeException(e); } // Wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); System.out.println("It may take up to a minute for the index to build before you can query using it."); ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); Document doc = null; while (doc == null) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { doc = current; } else { Thread.sleep(500); } } catch (Exception e) { throw new RuntimeException(e); } } System.out.println(indexName + " index is ready to query"); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 将
<dimensions>
占位符值替换为所用模型的相应变量:dimensionsHuggingFaceModel
:1024
维度("mixedbread-ai/mxbai-embed-large-v1" 模型)dimensionsOpenAiModel
:1536
个维度("text-embedding-3-small" 模型)
保存并运行该文件。 输出类似如下所示:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
VectorQuery.java
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
VectorQuery.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class VectorQuery { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); // define $vectorSearch query options String query = "ocean tragedy"; EmbeddingProvider embeddingProvider = new EmbeddingProvider(); BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query); List<Double> embedding = new ArrayList<>(); for (BsonValue value : embeddingBsonArray.stream().toList()) { embedding.add(value.asDouble().getValue()); } // define $vectorSearch pipeline String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embedding"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, embedding, indexName, limit, exactVectorSearchOptions() ), project( fields(exclude("_id"), include("text"), metaVectorSearchScore("score")))); // run query and print results List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); if (results.isEmpty()) { System.out.println("No results found."); } else { results.forEach(doc -> { System.out.println("Text: " + doc.getString("text")); System.out.println("Score: " + doc.getDouble("score")); }); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 保存并运行文件。输出类似于以下内容之一,具体取决于您使用的模型:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.004247286356985569 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.003116759704425931 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.002447686856612563 Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.45522359013557434 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.4049977660179138 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.35942474007606506
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embeddings
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
CreateIndex.java
的文件并粘贴以下代码:CreateIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoCursor; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class CreateIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); // define the index details String indexName = "vector_index"; int dimensionsHuggingFaceModel = 1024; int dimensionsOpenAiModel = 1536; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embeddings") .append("numDimensions", <dimensions>) // replace with var for the model used .append("similarity", "dotProduct"))); // define the index model using the specified details SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // create the index using the model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created a vector index named: " + result); System.out.println("It may take up to a minute for the index to build before you can query using it."); } catch (Exception e) { throw new RuntimeException(e); } // wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); Document doc = null; while (doc == null) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { doc = current; } else { Thread.sleep(500); } } catch (Exception e) { throw new RuntimeException(e); } } System.out.println(indexName + " index is ready to query"); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 将
<dimensions>
占位符值替换为所用模型的相应变量:dimensionsHuggingFaceModel
:1024
维度(开源)dimensionsOpenAiModel
:1536
维度
保存并运行该文件。 输出类似如下所示:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
VectorQuery.java
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
VectorQuery.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class VectorQuery { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); // define the query and get the embedding String query = "beach house"; EmbeddingProvider embeddingProvider = new EmbeddingProvider(); BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query); List<Double> embedding = new ArrayList<>(); for (BsonValue value : embeddingBsonArray.stream().toList()) { embedding.add(value.asDouble().getValue()); } // define $vectorSearch pipeline String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embeddings"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, embedding, indexName, limit, exactVectorSearchOptions()), project( fields(exclude("_id"), include("summary"), metaVectorSearchScore("score")))); // run query and print results List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); if (results.isEmpty()) { System.out.println("No results found."); } else { results.forEach(doc -> { System.out.println("Summary: " + doc.getString("summary")); System.out.println("Score: " + doc.getDouble("score")); }); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 保存并运行文件。输出类似于以下内容之一,具体取决于您使用的模型:
Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living. Score: 0.004518083296716213 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.0044807991944253445 Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway. Score: 0.004242129623889923 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.004227751865983009 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.004220190457999706 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.4832950830459595 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.48092085123062134 Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!! Score: 0.4629460275173187 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.4581468403339386 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2. Score: 0.45398443937301636
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
create-index.js
的文件并粘贴以下代码。create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("sample_db"); const collection = database.collection("embeddings"); // define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": <dimensions> } ] } } // run the helper method const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); 如果使用开源模型,则将
<dimensions>
占位符值替换为768
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
node --env-file=.env create-index.js
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.js
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // MongoDB connection URI and options const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_db"); const collection = database.collection("embeddings"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("ocean tragedy"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, text: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); 保存文件,然后运行以下命令:
node --env-file=.env vector-query.js '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}' '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}' '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}' node --env-file=.env vector-query.js {"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926} {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898} {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
create-index.js
的文件并粘贴以下代码。create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("sample_airbnb"); const collection = database.collection("listingsAndReviews"); // Define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": <dimensions> } ] } } // Call the method to create the index const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); 如果使用开源模型,则将
<dimensions>
占位符值替换为768
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
node --env-file=.env create-index.js
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.js
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // MongoDB connection URI and options const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_airbnb"); const collection = database.collection("listingsAndReviews"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("beach house"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, summary: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); 保存文件,然后运行以下命令:
node --env-file=.env vector-query.js '{"summary":"Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.","score":0.5334879159927368}' '{"summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":0.5240535736083984}' '{"summary":"The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.","score":0.5232879519462585}' '{"summary":"Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.","score":0.5186381340026855}' '{"summary":"A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.","score":0.5078228116035461}' node --env-file=.env vector-query.js {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646} {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码在集合上创建一个索引,将
embedding
字段指定为向量类型,将相似度函数指定为dotProduct
,并将维数指定为768
。from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 768 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码在集合上创建一个索引,将
embedding
字段指定为向量类型,将相似度函数指定为dotProduct
,并将维数指定为1536
。from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 1536 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"text": "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "score": 0.5166476964950562} {"text": "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "score": 0.4587385058403015} {"text": "The Lion King: Lion cub and future king Simba searches for his identity", "score": 0.41374602913856506}
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926} {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898} {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码在集合上创建一个索引,将
embedding
字段指定为向量类型,将相似度函数指定为dotProduct
,并将维数指定为768
。from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 768 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码在集合上创建一个索引,将
embedding
字段指定为向量类型,将相似度函数指定为dotProduct
,并将维数指定为1536
。from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 1536 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到聚合管道中的
queryVector
字段。运行向量搜索查询示例。该查询使用
$vectorSearch
阶段对您的嵌入进行 ENN 搜索。它按相关性顺序返回语义相似的文档及其向量搜索分数。注意
您的输出可能会有所不同,因为环境差异可能会引起嵌入的细微变化。
要了解更多信息,请参阅运行向量搜索查询。
# Generate embedding for the search query query_embedding = get_embedding("beach house") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "summary": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"summary": "Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.", "score": 0.5372996926307678} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.5297179818153381} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.5234108567237854} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.5171462893486023} {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.5095183253288269}
# Generate embedding for the search query query_embedding = get_embedding("beach house") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "summary": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646} {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
Considerations
在创建向量嵌入时请考虑以下因素:
选择创建嵌入的方法
要创建向量嵌入,您必须使用嵌入模型。嵌入模型是用于将数据转换为嵌入的算法。您可以选择以下方法之一连接到嵌入模型并创建向量嵌入:
方法 | 说明 |
---|---|
加载开源模型 | 如果您没有专有嵌入模型的 API 密钥,请从您的应用程序本地加载开源嵌入模型。 |
使用专有模型 | 大多数 AI 提供商都为其专有的嵌入模型提供 API,您可以使用这些模型创建向量嵌入。 |
利用一个集成 | 您可以将 Atlas Vector Search 与开源框架和 AI 服务集成,以快速连接到开源和专有嵌入模型,并为 Atlas Vector Search 生成矢量嵌入。 要了解更多信息,请参阅将 Vector Search 与 AI 技术集成。 |
选择内嵌模型
您选择的嵌入模型会影响您的查询结果,并决定您在 Atlas Vector Search 索引中指定的维度数。每种模型都提供不同的优势,具体取决于您的数据和使用案例。
有关常用嵌入模型的列表,请参阅海量文本嵌入基准 (MTEB) 。此列表提供了对各种开源和专有文本嵌入模型的深入见解,并允许您按使用案例、模型类型和特定模型指标来过滤模型。
为 Atlas Vector Search 选择嵌入模型时,请考虑以下指标:
验证您的嵌入
请考虑以下策略,以确保您的嵌入是正确且最佳的:
测试您的函数和脚本。
生成嵌入需要时间和计算资源。 在从大型数据集或集合创建嵌入之前,请测试嵌入函数或脚本是否在一小部分数据上按预期运行。
分批创建嵌入。
如果要从大型数据集或包含大量文档的集合生成嵌入,请分批创建嵌入,以避免内存问题并优化性能。
评估性能。
运行测试查询,检查搜索结果是否相关且排名是否准确。
故障排除
如果您在嵌入时遇到问题,请考虑以下策略:
验证您的环境。
检查是否已安装必要的依赖项且这些依赖项是最新的。 库版本冲突可能会导致意外行为。 创建新环境并仅安装所需的包,确保不存在冲突。
注意
如果使用 Colab,请确保笔记本会话的 IP 地址包含在 Atlas 项目的访问列表中。
监控内存使用情况。
如果您遇到性能问题,请检查 RAM、CPU 和磁盘使用情况,以确定任何潜在的瓶颈。对于 Colab 或 Jupyter Notebooks 等托管环境,请确保您的实例预配了足够的资源,并在必要时升级实例。
确保维度一致
验证Atlas Vector Search索引定义是否与Atlas中存储的嵌入维度相匹配,以及您的查询嵌入与索引嵌入的维度相匹配。 否则,在运行向量搜索查询时可能会遇到错误。
后续步骤
学会了如何创建嵌入和使用 Atlas Vector Search 查询嵌入之后,就可以通过实施检索增强生成 (RAG) 开始构建生成式人工智能应用:
您还可以将嵌入转换为BSON向量,以便在Atlas中高效存储和摄取向量。要学习;了解更多信息,请参阅如何摄取预量化向量。