如何创建向量嵌入
您可以将向量嵌入与其他数据一起存储在 Atlas 中。这些嵌入可以捕获数据中的有意义的关系,使您能够执行语义搜索并使用 Atlas Vector Search 实现 RAG。
开始体验
通过以下教程,学习;了解如何创建向量嵌入并使用Atlas Vector Search进行查询。 具体来说,您执行以下操作:
对于生产应用程序,您通常会写入脚本来生成向量嵌入。 您可以从此页面上的示例代码开始,并根据您的使用案例进行自定义。
➤ 使用 Select your language(选择您的语言)下拉菜单设置此页面上示例的语言。
先决条件
如要完成本教程,您必须具备以下条件:
一个Atlas账户,其集群运行的是MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
用于运行C#项目的终端和代码编辑器。
已安装.NET8.0 或更高版本。
Hugging Face 访问令牌或 OpenAI API 密钥。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
用于运行 Go 项目的终端和代码编辑器。
Go 已安装。
Hugging Face 访问令牌或 OpenAI API 密钥。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
Java 开发工具包 (JDK) 版本 8 或更高版本。
设立和运行Java应用程序的环境。我们建议您使用 IntelliJ IDEA 或 Eclipse IDE 等集成开发环境来配置 Maven 或 Gradle,以构建和运行项目。
以下之一:
一个 OpenAI API 密钥。您必须拥有付费的 OpenAI 账号,该账号具有可用于 API 请求的信用额度。要了解有关注册 OpenAI 账号的更多信息,请参阅 OpenAI API 网站。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
用于运行 Node.js 项目的终端和代码编辑器。
npm 和 Node.js 已安装。
如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。
一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。
运行交互式 Python 笔记本(如 VS Code 或 Colab) 的环境。
如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。
定义嵌入函数
设置环境变量。
导出环境变量,在 PowerShell 中 set
它们,或使用 IDE 的环境变量管理器,使连接字符串和 HugingFace访问权限令牌可供项目使用。
export HUGGINGFACE_ACCESS_TOKEN="<access-token>" export ATLAS_CONNECTION_STRING="<connection-string>"
将 <access-token>
占位符值替换为您的 Huging Face访问权限令牌。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义一个函数来生成向量嵌入。
在名为AIService.cs
的同名文件中创建一个新类,然后粘贴以下代码。这段代码定义了一个名为GetEmbeddingsAsync
的异步任务,为一大量给定的字符串输入生成一大量嵌入。该函数使用 mxbai-embed-large-v1 嵌入模型。
namespace MyCompany.Embeddings; using System; using System.Net.Http; using System.Text.Json; using System.Threading.Tasks; using System.Net.Http.Headers; public class AIService { private static readonly string? HuggingFaceAccessToken = Environment.GetEnvironmentVariable("HUGGINGFACE_ACCESS_TOKEN"); private static readonly HttpClient Client = new HttpClient(); public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts) { const string modelName = "mixedbread-ai/mxbai-embed-large-v1"; const string url = $"https://api-inference.huggingface.co/models/{modelName}"; Client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", HuggingFaceAccessToken); var data = new { inputs = texts }; var dataJson = JsonSerializer.Serialize(data); var content = new StringContent(dataJson,null, "application/json"); var response = await Client.PostAsync(url, content); response.EnsureSuccessStatusCode(); var responseString = await response.Content.ReadAsStringAsync(); var embeddings = JsonSerializer.Deserialize<float[][]>(responseString); if (embeddings is null) { throw new ApplicationException("Failed to deserialize embeddings response to an array of floats."); } Dictionary<string, float[]> documentData = new Dictionary<string, float[]>(); var embeddingCount = embeddings.Length; foreach (var value in Enumerable.Range(0, embeddingCount)) { // Pair each embedding with the text used to generate it. documentData[texts[value]] = embeddings[value]; } return documentData; } }
注意
503 调用 Hushing Face 模型时
在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。
设置环境变量。
导出环境变量,在 PowerShell 中 set
环境变量,或使用 IDE 的环境变量管理器,使连接字符串和 HugingFace访问权限令牌可供您的项目使用。
export OPENAI_API_KEY="<api-key>" export ATLAS_CONNECTION_STRING="<connection-string>"
将 <api-key>
占位符值替换为您的 OpenAI API密钥。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义一个函数来生成向量嵌入。
在名为 AIService.cs
的同名文件中创建一个新类,然后粘贴以下代码。这段代码定义了一个名为 GetEmbeddingsAsync
的异步任务,为一大量给定的字符串输入生成一大量嵌入。此函数使用 OpenAI 的 text-embedding-3-small
模型为给定输入生成嵌入。
namespace MyCompany.Embeddings; using OpenAI.Embeddings; using System; using System.Threading.Tasks; public class AIService { private static readonly string? OpenAIApiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY"); private static readonly string EmbeddingModelName = "text-embedding-3-small"; public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts) { EmbeddingClient embeddingClient = new(model: EmbeddingModelName, apiKey: OpenAIApiKey); Dictionary<string, float[]> documentData = new Dictionary<string, float[]>(); try { var result = await embeddingClient.GenerateEmbeddingsAsync(texts); var embeddingCount = result.Value.Count; foreach (var index in Enumerable.Range(0, embeddingCount)) { // Pair each embedding with the text used to generate it. documentData[texts[index]] = result.Value[index].ToFloats().ToArray(); } } catch (Exception e) { throw new ApplicationException(e.Message); } return documentData; } }
在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用开源嵌入模型还是 OpenAI 等专有模型来选择一个标签页。
注意
开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。
创建.env
文件来管理密钥。
在项目中,创建 .env
文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。
HUGGINGFACEHUB_API_TOKEN = "<access-token>" ATLAS_CONNECTION_STRING = "<connection-string>"
将 <access-token>
占位符值替换为您的 Huging Face访问权限令牌。
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义一个函数来生成向量嵌入。
在项目中创建一个名为
common
的目录,用于存储您将在后续步骤中使用的常用代码:mkdir common && cd common 创建一个名为
get-embeddings.go
的文件并粘贴以下代码。此代码定义了一个名为GetEmbeddings
的函数,用于为给定输入生成嵌入。此函数指定:feature-extraction
任务使用 LangChain 库的 Go 端口。要了解更多信息,请参阅 LangChain JavaScript 文档中的任务文档。mxbai-embed-large-v1 嵌入模型。
get-embeddings.gopackage common import ( "context" "log" "github.com/tmc/langchaingo/embeddings/huggingface" ) func GetEmbeddings(documents []string) [][]float32 { hf, err := huggingface.NewHuggingface( huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"), huggingface.WithTask("feature-extraction")) if err != nil { log.Fatalf("failed to connect to Hugging Face: %v", err) } embs, err := hf.EmbedDocuments(context.Background(), documents) if err != nil { log.Fatalf("failed to generate embeddings: %v", err) } return embs } 注意
503 调用 Hushing Face 模型时
在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。
返回到主项目根目录。
cd ../
创建.env
文件来管理密钥。
在您的项目中,创建一个.env
string文件来存储连接字符串和 OpenAIAPI API令牌。
OPENAI_API_KEY = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key>
和 <connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义一个函数来生成向量嵌入。
在项目中创建一个名为
common
的目录,用于存储将在多个步骤中使用的代码:mkdir common && cd common 创建一个名为
get-embeddings.go
的文件并粘贴以下代码。此代码定义了一个名为GetEmbeddings
的函数,该函数使用 OpenAI 的text-embedding-3-small
模型为给定输入生成嵌入。get-embeddings.gopackage common import ( "context" "log" "github.com/milosgajdos/go-embeddings/openai" ) func GetEmbeddings(docs []string) [][]float64 { c := openai.NewClient() embReq := &openai.EmbeddingRequest{ Input: docs, Model: openai.TextSmallV3, EncodingFormat: openai.EncodingFloat, } embs, err := c.Embed(context.Background(), embReq) if err != nil { log.Fatalf("failed to connect to OpenAI: %v", err) } var vectors [][]float64 for _, emb := range embs { vectors = append(vectors, emb.Vector) } return vectors } 返回到主项目根目录。
cd ../
在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用开源嵌入模型还是 OpenAI 等专有模型来选择一个标签页。
注意
开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件的dependencies
大量中:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for working with Hugging Face models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-hugging-face</artifactId> <version>0.35.0</version> </dependency> </dependencies> 如果您使用 Gradle,请将以下内容添加到项目
build.gradle
文件的dependencies
大量中:build.gradledependencies { // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)' // Java library for working with Hugging Face models implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0' } 运行包管理器以安装项目。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后在 Environment variables字段中将变量添加为分号分隔的值(示例
FOO=123;BAR=456
)。应用更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
HUGGING_FACE_ACCESS_TOKEN=<access-token> ATLAS_CONNECTION_STRING=<connection-string>
使用以下值更新占位符:
将``<access-token>`` 占位符值替换为您的 Huging Face访问权限令牌。
用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义生成向量嵌入的方法。
创建一个名为EmbeddingProvider.java
的文件并粘贴以下代码。
此代码定义了两种使用 mxbai-embed-large-v1 开源嵌入模型为给定输入生成嵌入的方法:
getEmbeddings
多个输入:List<String>
方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。单个输入:
getEmbedding
方法接受单个String
,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static HuggingFaceEmbeddingModel embeddingModel; private static HuggingFaceEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN"); if (accessToken == null || accessToken.isEmpty()) { throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty."); } embeddingModel = HuggingFaceEmbeddingModel.builder() .accessToken(accessToken) .modelId("mixedbread-ai/mxbai-embed-large-v1") .waitForModel(true) .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Takes an array of strings and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<String> texts) { List<TextSegment> textSegments = texts.stream() .map(TextSegment::from) .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
创建Java项目并安装依赖项。
在 IDE 中,使用 Maven 或 Gradle 创建Java项目。
根据您的包管理器,添加以下依赖项:
如果使用 Maven,请将以下依赖项添加到项目的
pom.xml
文件的dependencies
大量中:pom.xml<dependencies> <!-- MongoDB Java Sync Driver v5.2.0 or later --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>[5.2.0,)</version> </dependency> <!-- Java library for working with OpenAI models --> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-open-ai</artifactId> <version>0.35.0</version> </dependency> </dependencies> 如果您使用 Gradle,请将以下内容添加到项目
build.gradle
文件的dependencies
大量中:build.gradledependencies { // MongoDB Java Sync Driver v5.2.0 or later implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)' // Java library for working with OpenAI models implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0' } 运行包管理器以安装项目。
设置环境变量。
注意
此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。
在 IDE 中,创建新的配置模板并将以下变量添加到项目中:
如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后将变量作为分号分隔值添加到 Environment variables字段(示例
FOO=123;BAR=456
)。应用更改并单击 OK。如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK。
要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。
OPEN_AI_API_KEY=<api-key> ATLAS_CONNECTION_STRING=<connection-string>
使用以下值更新占位符:
将``<api-key>`` 占位符值替换为您的 OpenAI API密钥。
用 Atlas 集群的 SRV 连接字符串替换
<connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
定义生成向量嵌入的方法。
创建一个名为EmbeddingProvider.java
的文件并粘贴以下代码。
此代码定义了两种使用 text-embedding-3 -small OpenAI 嵌入模型为给定输入生成嵌入的方法:
getEmbeddings
多个输入:List<String>
方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。单个输入:
getEmbedding
方法接受单个String
,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。
import dev.langchain4j.data.embedding.Embedding; import dev.langchain4j.data.segment.TextSegment; import dev.langchain4j.model.openai.OpenAiEmbeddingModel; import dev.langchain4j.model.output.Response; import org.bson.BsonArray; import org.bson.BsonDouble; import java.util.List; import static java.time.Duration.ofSeconds; public class EmbeddingProvider { private static OpenAiEmbeddingModel embeddingModel; private static OpenAiEmbeddingModel getEmbeddingModel() { if (embeddingModel == null) { String apiKey = System.getenv("OPEN_AI_API_KEY"); if (apiKey == null || apiKey.isEmpty()) { throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty."); } return OpenAiEmbeddingModel.builder() .apiKey(apiKey) .modelName("text-embedding-3-small") .timeout(ofSeconds(60)) .build(); } return embeddingModel; } /** * Takes an array of strings and returns a BSON array of embeddings to * store in the database. */ public List<BsonArray> getEmbeddings(List<String> texts) { List<TextSegment> textSegments = texts.stream() .map(TextSegment::from) .toList(); Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments); return response.content().stream() .map(e -> new BsonArray( e.vectorAsList().stream() .map(BsonDouble::new) .toList())) .toList(); } /** * Takes a single string and returns a BSON array embedding to * use in a vector query. */ public BsonArray getEmbedding(String text) { Response<Embedding> response = getEmbeddingModel().embed(text); return new BsonArray( response.content().vectorAsList().stream() .map(BsonDouble::new) .toList()); } }
在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用开源嵌入模型还是 OpenAI 等专有模型来选择一个标签页。
注意
开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。
更新您的 package.json
文件。
将您的项目配置为使用 ES 模块 ,方法是将 "type": "module"
添加到 package.json
文件中,然后将其保存。
{ "type": "module", // other fields... }
创建 .env
文件。
在您的项目中,创建一个 .env
文件来存储您的 Atlas 连接字符串。
ATLAS_CONNECTION_STRING = "<connection-string>"
用 Atlas 集群的 SRV 连接字符串替换 <connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
注意
最低 Node.js 版本要求
Node.js v 20 .x 引入了 --env-file
选项。如果您使用的是旧版本的 Node.js,请将 dotenv
包添加到项目中,或使用其他方法来管理环境变量。
定义一个函数来生成向量嵌入。
创建一个名为 get-embeddings.js
的文件并粘贴以下代码。这段代码定义了一个被命名用于为给定输入生成嵌入的函数。此函数指定:
来自 Hugging Face 的transformers.js 库的
feature-extraction
任务。要了解更多信息,请参阅任务。nomic-embed-text-v1 嵌入模型。
import { pipeline } from '@xenova/transformers'; // Function to generate embeddings for a given data source export async function getEmbedding(data) { const embedder = await pipeline( 'feature-extraction', 'Xenova/nomic-embed-text-v1'); const results = await embedder(data, { pooling: 'mean', normalize: true }); return Array.from(results.data); }
更新您的 package.json
文件。
将您的项目配置为使用 ES 模块 ,方法是将 "type": "module"
添加到 package.json
文件中,然后将其保存。
{ "type": "module", // other fields... }
创建 .env
文件。
在项目中,创建一个 .env
文件来存储 Atlas 连接字符串和 OpenAI API 密钥。
OPENAI_API_KEY = "<api-key>" ATLAS_CONNECTION_STRING = "<connection-string>"
用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key>
和 <connection-string>
占位符值。连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
注意
最低 Node.js 版本要求
Node.js v 20 .x 引入了 --env-file
选项。如果您使用的是旧版本的 Node.js,请将 dotenv
包添加到项目中,或使用其他方法来管理环境变量。
定义一个函数来生成向量嵌入。
创建一个名为 get-embeddings.js
的文件并粘贴以下代码。此代码定义了一个名为 getEmbedding
的函数,该函数使用 OpenAI 的 text-embedding-3-small
模型为给定输入生成嵌入。
import OpenAI from 'openai'; // Setup OpenAI configuration const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY}); // Function to get the embeddings using the OpenAI API export async function getEmbedding(text) { const results = await openai.embeddings.create({ model: "text-embedding-3-small", input: text, encoding_format: "float", }); return results.data[0].embedding; }
在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用 Nomic 的开源嵌入模型还是 OpenAI 的专有模型来选择一个标签页。
该开源示例还包含一个函数,可将您的嵌入转换为BSONbinData
向量,以便进行高效处理。只有某些嵌入模型支持字节向量输出。对于未启用自动量化的嵌入模型(例如 OpenAI 中的模型),会在创建Atlas Vector Search索引时启用自动量化。
注意
开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。
定义生成向量嵌入的函数。
在笔记本中粘贴并运行以下代码,以创建一个函数,该函数使用 Nomic AI 的开源嵌入模型生成向量嵌入。此代码执行以下操作:
加载 nomic-embed-text-v1 嵌入模型。
创建一个名为
get_embedding
的函数,它使用模型生成float32
,它被设立为给定文本输入的默认精度、int8
或int1
嵌入。
from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings # Load the embedding model model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True) # Define a function to generate embeddings in multiple precisions def get_embedding(data, precision="float32"): return model.encode(data, precision=precision)
定义用于转换向量嵌入的函数。
在笔记本中粘贴并运行以下代码,创建一个使用PyMongo 驱动程序转换向量嵌入的函数。名为generate_bson_vector
的函数用于将全保真嵌入转换为BSONfloat32
、int8
和int1
vector
子类型,以便高效处理向量数据。
from bson.binary import Binary # Generate BSON vector using `BinaryVectorDtype` def generate_bson_vector(vector, vector_dtype): return Binary.from_vector(vector, vector_dtype)
定义函数以创建带有嵌入的文档。
在笔记本中粘贴并运行以下代码,以创建一个名为 create_docs_with_bson_vector_embeddings
的函数,该函数用于创建带有嵌入的文档。
from bson.binary import Binary # Function to create documents with BSON vector embeddings def create_docs_with_bson_vector_embeddings(bson_float32, bson_int8, bson_int1, data): docs = [] for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32, bson_int8, bson_int1, data)): doc = { "_id": i, "data": text, "BSON-Float32-Embedding": bson_f32_emb, "BSON-Int8-Embedding": bson_int8_emb, "BSON-Int1-Embedding": bson_int1_emb, } docs.append(doc) return docs
测试函数以生成嵌入。
在笔记本中粘贴并运行以下代码,以测试使用 Nomic AI生成嵌入的函数。
此代码为字符串 foo
和 bar
生成 float32
、int8
、int1
嵌入。
# Example generating embeddings for the strings "foo" and "bar" data = ["foo", "bar"] float32_embeddings = get_embedding(data, "float32") int8_embeddings = get_embedding(data, "int8") int1_embeddings = get_embedding(data, "ubinary") print("Float32 Embedding:", float32_embeddings) print("Int8 Embedding:", int8_embeddings) print("Int1 Embedding (binary representation):", int1_embeddings)
Float32 Embedding: [ [-0.02980827 0.03841474 -0.02561123 ... -0.0532876 -0.0335409 -0.02591543] [-0.02748881 0.03717749 -0.03104552 ... 0.02413219 -0.02402252 0.02810651] ] Int8 Embedding: [ [-128 127 127 ... -128 -128 -128] [ 126 -128 -128 ... 127 126 127] ] Int1 Embedding (binary representation): [ [ 77 30 4 131 15 123 146 ... 159 142 205 23 119 120] [ 79 82 208 180 45 79 209 ... 158 100 141 189 166 173] ]
测试将嵌入转换为BSON向量的函数。
在笔记本中粘贴并运行以下代码,以测试使用PyMongo驾驶员将嵌入转换为BSON向量的函数。
此代码可对字符串 foo
和 bar
的 float32
、int8
和 int1
嵌入进行量化。
from bson.binary import Binary, BinaryVectorDtype bson_float32_embeddings = [] bson_int8_embeddings = [] bson_int1_embeddings = [] for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings): bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32)) bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8)) bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT)) # Print the embeddings print(f"The converted bson_float32_new_embedding is: {bson_float32_embeddings}") print(f"The converted bson_int8_new_embedding is: {bson_int8_embeddings}") print(f"The converted bson_int1_new_embedding is: {bson_int1_embeddings}")
The converted bson_float32_new_embedding is: [Binary(b'\'\x00x0\xf4\ ... x9bL\xd4\xbc', 9), Binary(b'\'\x007 ... \x9e?\xe6<', 9)] The converted bson_int8_new_embedding is: [Binary(b'\x03\x00\x80\x7f\ ... x80\x80', 9), Binary(b'\x03\x00~\x80 ... \x7f', 9)] The converted bson_int1_new_embedding is: [Binary(b'\x10\x00M\x1e\ ... 7wx', 9), Binary(b'\x10\x00OR\ ... \xa6\xad', 9)]
定义一个函数来生成向量嵌入。
在笔记本中粘贴并运行以下代码,创建一个函数,该函数利用 OpenAI 的专有嵌入模型生成向量嵌入。用您的 OpenAI API 密钥替换 <api-key>
。此代码执行以下操作:
指定
text-embedding-3-small
内嵌模型。创建一个名为
get_embedding
的函数,该函数调用模型的 API 来为给定的文本输入生成嵌入。通过为字符串
foo
生成单个嵌入来测试函数。
import os from openai import OpenAI # Specify your OpenAI API key and embedding model os.environ["OPENAI_API_KEY"] = "<api-key>" model = "text-embedding-3-small" openai_client = OpenAI() # Define a function to generate embeddings def get_embedding(text): """Generates vector embeddings for the given text.""" embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding return embedding # Generate an embedding get_embedding("foo")
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]
从数据创建嵌入
在本部分中,您将使用您定义的函数从数据创建向量嵌入,然后存储这些嵌入存储在Atlas的集合中。
根据要从新数据还是从 Atlas 中已有的现有数据创建嵌入,选择一个标签页。
定义一个DataService
类。
在名为DataService.cs
的同名文件中创建一个新类,然后粘贴以下代码。此代码定义了一个名为AddDocumentsAsync
add documents to Atlas 的异步任务。此函数使用 Collection.InsertManyAsync() C#驱动程序方法插入BsonDocument
类型的列表。每个文档包含:
包含电影摘要的
text
字段。一个
embedding
字段,其中包含生成向量嵌入的浮点数大量。
namespace MyCompany.Embeddings; using MongoDB.Driver; using MongoDB.Bson; public class DataService { private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING"); private static readonly MongoClient Client = new MongoClient(ConnectionString); private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db"); private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings"); public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings) { var documents = new List<BsonDocument>(); foreach( KeyValuePair<string, float[]> var in embeddings ) { var document = new BsonDocument { { "text", var.Key }, { "embedding", new BsonArray(var.Value) } }; documents.Add(document); } await Collection.InsertManyAsync(documents); Console.WriteLine($"Successfully inserted {embeddings.Count} documents into Atlas"); documents.Clear(); } }
Program.cs
更新项目中的 。
使用以下代码从 Atlas 中的现有集合生成嵌入。
具体来说,此代码使用您定义的 GetEmbeddingsAsync
函数从示例文本大量生成嵌入,并将其摄取到Atlas中的 sample_db.embeddings
集合中。
using MyCompany.Embeddings; var aiService = new AIService(); var texts = new string[] { "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" }; var embeddings = await aiService.GetEmbeddingsAsync(texts); var dataService = new DataService(); await dataService.AddDocumentsAsync(embeddings);
编译并运行项目。
dotnet run MyCompany.Embeddings.csproj
Successfully inserted 3 documents into Atlas
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
定义一个DataService
类。
在名为 DataService.cs
的同名文件中创建一个新类,然后粘贴以下代码。此代码创建两个执行以下操作的函数:
连接到您的 Atlas 集群。
GetDocuments
方法从sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。AddEmbeddings
异步任务在sample_airbnb.listingsAndReviews
集合中的文档上创建一个新的embeddings
字段,该集合的_id
与GetDocuments
方法中检索到的文档之一匹配。
namespace MyCompany.Embeddings; using MongoDB.Driver; using MongoDB.Bson; public class DataService { private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING"); private static readonly MongoClient Client = new MongoClient(ConnectionString); private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb"); private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews"); public List<BsonDocument>? GetDocuments() { var filter = Builders<BsonDocument>.Filter.And( Builders<BsonDocument>.Filter.And( Builders<BsonDocument>.Filter.Exists("summary", true), Builders<BsonDocument>.Filter.Ne("summary", "") ), Builders<BsonDocument>.Filter.Exists("embeddings", false) ); return Collection.Find(filter).Limit(50).ToList(); } public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings) { var listWrites = new List<WriteModel<BsonDocument>>(); foreach( var kvp in embeddings ) { var filterForUpdate = Builders<BsonDocument>.Filter.Eq("summary", kvp.Key); var updateDefinition = Builders<BsonDocument>.Update.Set("embeddings", kvp.Value); listWrites.Add(new UpdateOneModel<BsonDocument>(filterForUpdate, updateDefinition)); } var result = await Collection.BulkWriteAsync(listWrites); listWrites.Clear(); return result.ModifiedCount; } }
Program.cs
更新项目中的 。
使用以下代码从 Atlas 中的现有集合生成嵌入。
具体来说,此代码使用您定义的 GetEmbeddingsAsync
函数从示例文本大量生成嵌入,并将其摄取到Atlas中的 sample_db.embeddings
集合中。
using MyCompany.Embeddings; var dataService = new DataService(); var documents = dataService.GetDocuments(); if (documents != null) { Console.WriteLine("Generating embeddings."); var aiService = new AIService(); var summaries = new List<string>(); foreach (var document in documents) { var summary = document.GetValue("summary").ToString(); if (summary != null) { summaries.Add(summary); } } try { if (summaries.Count > 0) { var embeddings = await aiService.GetEmbeddingsAsync(summaries.ToArray()); try { var updatedCount = await dataService.AddEmbeddings(embeddings); Console.WriteLine($"{updatedCount} documents updated successfully."); } catch (Exception e) { Console.WriteLine($"Error adding embeddings to MongoDB: {e.Message}"); } } } catch (Exception e) { Console.WriteLine($"Error creating embeddings for summaries: {e.Message}"); } } else { Console.WriteLine("No documents found"); }
创建一个名为 create-embeddings.go
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。
具体来说,这段代码使用您定义的 GetEmbeddings
函数和 MongoDB Go 驱动程序,从示例文本数组生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings
集合中。
package main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) var data = []string{ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", } type TextWithEmbedding struct { Text string Embedding []float32 } func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") embeddings := common.GetEmbeddings(data) docsToInsert := make([]interface{}, len(embeddings)) for i, string := range data { docsToInsert[i] = TextWithEmbedding{ Text: string, Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) }
package main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) var data = []string{ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", } type TextWithEmbedding struct { Text string Embedding []float64 } func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") embeddings := common.GetEmbeddings(data) docsToInsert := make([]interface{}, len(data)) for i, movie := range data { docsToInsert[i] = TextWithEmbedding{ Text: movie, Embedding: embeddings[i], } } result, err := coll.InsertMany(ctx, docsToInsert) if err != nil { log.Fatalf("failed to insert documents: %v", err) } fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs)) }
保存并运行该文件。
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
创建一个名为 create-embeddings.go
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您定义的
GetEmbeddings
函数,从每个文档的summary
字段生成嵌入。使用 MongoDB Go 驱动程序,用包含嵌入值的新
embeddings
字段更新每个文档。
package main import ( "context" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") filter := bson.D{ {"$and", bson.A{ bson.D{ {"$and", bson.A{ bson.D{{"summary", bson.D{{"$exists", true}}}}, bson.D{{"summary", bson.D{{"$ne", ""}}}}, }, }}, bson.D{{"embeddings", bson.D{{"$exists", false}}}}, }}, } opts := options.Find().SetLimit(50) cursor, err := coll.Find(ctx, filter, opts) if err != nil { log.Fatalf("failed to retrieve documents: %v", err) } var listings []common.Listing if err = cursor.All(ctx, &listings); err != nil { log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err) } var summaries []string for _, listing := range listings { summaries = append(summaries, listing.Summary) } log.Println("Generating embeddings.") embeddings := common.GetEmbeddings(summaries) docsToUpdate := make([]mongo.WriteModel, len(listings)) for i := range listings { docsToUpdate[i] = mongo.NewUpdateOneModel(). SetFilter(bson.D{{"_id", listings[i].ID}}). SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}}) } bulkWriteOptions := options.BulkWrite().SetOrdered(false) result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions) if err != nil { log.Fatalf("failed to write embeddings to existing documents: %v", err) } log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount) }
创建一个包含该集合的Go模型的文件。
为了简化与BSON之间的编组和解组Go对象,请创建一个包含此集合中文档模型的文件。
进入
common
目录。cd common 创建一个名为
models.go
的文件,并将以下代码粘贴到其中:models.gopackage common import ( "time" "go.mongodb.org/mongo-driver/bson/primitive" ) type Image struct { ThumbnailURL string `bson:"thumbnail_url"` MediumURL string `bson:"medium_url"` PictureURL string `bson:"picture_url"` XLPictureURL string `bson:"xl_picture_url"` } type Host struct { ID string `bson:"host_id"` URL string `bson:"host_url"` Name string `bson:"host_name"` Location string `bson:"host_location"` About string `bson:"host_about"` ThumbnailURL string `bson:"host_thumbnail_url"` PictureURL string `bson:"host_picture_url"` Neighborhood string `bson:"host_neighborhood"` IsSuperhost bool `bson:"host_is_superhost"` HasProfilePic bool `bson:"host_has_profile_pic"` IdentityVerified bool `bson:"host_identity_verified"` ListingsCount int32 `bson:"host_listings_count"` TotalListingsCount int32 `bson:"host_total_listings_count"` Verifications []string `bson:"host_verifications"` } type Location struct { Type string `bson:"type"` Coordinates []float64 `bson:"coordinates"` IsLocationExact bool `bson:"is_location_exact"` } type Address struct { Street string `bson:"street"` Suburb string `bson:"suburb"` GovernmentArea string `bson:"government_area"` Market string `bson:"market"` Country string `bson:"Country"` CountryCode string `bson:"country_code"` Location Location `bson:"location"` } type Availability struct { Thirty int32 `bson:"availability_30"` Sixty int32 `bson:"availability_60"` Ninety int32 `bson:"availability_90"` ThreeSixtyFive int32 `bson:"availability_365"` } type ReviewScores struct { Accuracy int32 `bson:"review_scores_accuracy"` Cleanliness int32 `bson:"review_scores_cleanliness"` CheckIn int32 `bson:"review_scores_checkin"` Communication int32 `bson:"review_scores_communication"` Location int32 `bson:"review_scores_location"` Value int32 `bson:"review_scores_value"` Rating int32 `bson:"review_scores_rating"` } type Review struct { ID string `bson:"_id"` Date time.Time `bson:"date,omitempty"` ListingId string `bson:"listing_id"` ReviewerId string `bson:"reviewer_id"` ReviewerName string `bson:"reviewer_name"` Comments string `bson:"comments"` } type Listing struct { ID string `bson:"_id"` ListingURL string `bson:"listing_url"` Name string `bson:"name"` Summary string `bson:"summary"` Space string `bson:"space"` Description string `bson:"description"` NeighborhoodOverview string `bson:"neighborhood_overview"` Notes string `bson:"notes"` Transit string `bson:"transit"` Access string `bson:"access"` Interaction string `bson:"interaction"` HouseRules string `bson:"house_rules"` PropertyType string `bson:"property_type"` RoomType string `bson:"room_type"` BedType string `bson:"bed_type"` MinimumNights string `bson:"minimum_nights"` MaximumNights string `bson:"maximum_nights"` CancellationPolicy string `bson:"cancellation_policy"` LastScraped time.Time `bson:"last_scraped,omitempty"` CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"` FirstReview time.Time `bson:"first_review,omitempty"` LastReview time.Time `bson:"last_review,omitempty"` Accommodates int32 `bson:"accommodates"` Bedrooms int32 `bson:"bedrooms"` Beds int32 `bson:"beds"` NumberOfReviews int32 `bson:"number_of_reviews"` Bathrooms primitive.Decimal128 `bson:"bathrooms"` Amenities []string `bson:"amenities"` Price primitive.Decimal128 `bson:"price"` WeeklyPrice primitive.Decimal128 `bson:"weekly_price"` MonthlyPrice primitive.Decimal128 `bson:"monthly_price"` CleaningFee primitive.Decimal128 `bson:"cleaning_fee"` ExtraPeople primitive.Decimal128 `bson:"extra_people"` GuestsIncluded primitive.Decimal128 `bson:"guests_included"` Image Image `bson:"images"` Host Host `bson:"host"` Address Address `bson:"address"` Availability Availability `bson:"availability"` ReviewScores ReviewScores `bson:"review_scores"` Reviews []Review `bson:"reviews"` Embeddings []float32 `bson:"embeddings,omitempty"` } models.gopackage common import ( "time" "go.mongodb.org/mongo-driver/bson/primitive" ) type Image struct { ThumbnailURL string `bson:"thumbnail_url"` MediumURL string `bson:"medium_url"` PictureURL string `bson:"picture_url"` XLPictureURL string `bson:"xl_picture_url"` } type Host struct { ID string `bson:"host_id"` URL string `bson:"host_url"` Name string `bson:"host_name"` Location string `bson:"host_location"` About string `bson:"host_about"` ThumbnailURL string `bson:"host_thumbnail_url"` PictureURL string `bson:"host_picture_url"` Neighborhood string `bson:"host_neighborhood"` IsSuperhost bool `bson:"host_is_superhost"` HasProfilePic bool `bson:"host_has_profile_pic"` IdentityVerified bool `bson:"host_identity_verified"` ListingsCount int32 `bson:"host_listings_count"` TotalListingsCount int32 `bson:"host_total_listings_count"` Verifications []string `bson:"host_verifications"` } type Location struct { Type string `bson:"type"` Coordinates []float64 `bson:"coordinates"` IsLocationExact bool `bson:"is_location_exact"` } type Address struct { Street string `bson:"street"` Suburb string `bson:"suburb"` GovernmentArea string `bson:"government_area"` Market string `bson:"market"` Country string `bson:"Country"` CountryCode string `bson:"country_code"` Location Location `bson:"location"` } type Availability struct { Thirty int32 `bson:"availability_30"` Sixty int32 `bson:"availability_60"` Ninety int32 `bson:"availability_90"` ThreeSixtyFive int32 `bson:"availability_365"` } type ReviewScores struct { Accuracy int32 `bson:"review_scores_accuracy"` Cleanliness int32 `bson:"review_scores_cleanliness"` CheckIn int32 `bson:"review_scores_checkin"` Communication int32 `bson:"review_scores_communication"` Location int32 `bson:"review_scores_location"` Value int32 `bson:"review_scores_value"` Rating int32 `bson:"review_scores_rating"` } type Review struct { ID string `bson:"_id"` Date time.Time `bson:"date,omitempty"` ListingId string `bson:"listing_id"` ReviewerId string `bson:"reviewer_id"` ReviewerName string `bson:"reviewer_name"` Comments string `bson:"comments"` } type Listing struct { ID string `bson:"_id"` ListingURL string `bson:"listing_url"` Name string `bson:"name"` Summary string `bson:"summary"` Space string `bson:"space"` Description string `bson:"description"` NeighborhoodOverview string `bson:"neighborhood_overview"` Notes string `bson:"notes"` Transit string `bson:"transit"` Access string `bson:"access"` Interaction string `bson:"interaction"` HouseRules string `bson:"house_rules"` PropertyType string `bson:"property_type"` RoomType string `bson:"room_type"` BedType string `bson:"bed_type"` MinimumNights string `bson:"minimum_nights"` MaximumNights string `bson:"maximum_nights"` CancellationPolicy string `bson:"cancellation_policy"` LastScraped time.Time `bson:"last_scraped,omitempty"` CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"` FirstReview time.Time `bson:"first_review,omitempty"` LastReview time.Time `bson:"last_review,omitempty"` Accommodates int32 `bson:"accommodates"` Bedrooms int32 `bson:"bedrooms"` Beds int32 `bson:"beds"` NumberOfReviews int32 `bson:"number_of_reviews"` Bathrooms primitive.Decimal128 `bson:"bathrooms"` Amenities []string `bson:"amenities"` Price primitive.Decimal128 `bson:"price"` WeeklyPrice primitive.Decimal128 `bson:"weekly_price"` MonthlyPrice primitive.Decimal128 `bson:"monthly_price"` CleaningFee primitive.Decimal128 `bson:"cleaning_fee"` ExtraPeople primitive.Decimal128 `bson:"extra_people"` GuestsIncluded primitive.Decimal128 `bson:"guests_included"` Image Image `bson:"images"` Host Host `bson:"host"` Address Address `bson:"address"` Availability Availability `bson:"availability"` ReviewScores ReviewScores `bson:"review_scores"` Reviews []Review `bson:"reviews"` Embeddings []float64 `bson:"embeddings,omitempty"` } 返回到项目根目录。
cd ../
生成嵌入。
go run create-embeddings.go
2024/10/10 09:58:03 Generating embeddings. 2024/10/10 09:58:12 Successfully added embeddings to 50 documents
您可以通过导航到 Atlas UI 中的 sample_airbnb.listingsAndReviews
集合来查看生成的矢量嵌入。
定义代码以从Atlas中的现有集合生成嵌入。
创建一个名为CreateEmbeddings.java
的文件并粘贴以下代码。
此代码使用getEmbeddings
方法和MongoDB Java同步驱动程序来执行以下操作:
连接到您的 Atlas 集群。
获取示例文本大量。
使用您之前定义的
getEmbeddings
方法从每个文本生成嵌入。将嵌入引入Atlas中的
sample_db.embeddings
集合。
import com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.result.InsertManyResult; import org.bson.BsonArray; import org.bson.Document; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class CreateEmbeddings { static List<String> data = Arrays.asList( "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ); public static void main(String[] args){ String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); System.out.println("Creating embeddings for " + data.size() + " documents"); EmbeddingProvider embeddingProvider = new EmbeddingProvider(); // generate embeddings for new inputted data List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data); List<Document> documents = new ArrayList<>(); int i = 0; for (String text : data) { Document doc = new Document("text", text).append("embedding", embeddings.get(i)); documents.add(doc); i++; } // insert the embeddings into the Atlas collection List<String> insertedIds = new ArrayList<>(); try { InsertManyResult result = collection.insertMany(documents); result.getInsertedIds().values() .forEach(doc -> insertedIds.add(doc.toString())); System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } }
生成嵌入。
保存并运行该文件。 输出类似如下所示:
Creating embeddings for 3 documents Inserted 3 documents with the following ids to sample_db.embeddings collection: [BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
定义代码以从Atlas中的现有集合生成嵌入。
创建一个名为CreateEmbeddings.java
的文件并粘贴以下代码。
此代码使用getEmbeddings
方法和MongoDB Java同步驱动程序来执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您之前定义的
getEmbeddings
方法,从每个文档的summary
字段生成嵌入。使用包含嵌入值的新
embeddings
字段更新每个文档。
import com.mongodb.MongoException; import com.mongodb.bulk.BulkWriteResult; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.BulkWriteOptions; import com.mongodb.client.model.Filters; import com.mongodb.client.model.UpdateOneModel; import com.mongodb.client.model.Updates; import com.mongodb.client.model.WriteModel; import org.bson.BsonArray; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; public class CreateEmbeddings { public static void main(String[] args){ String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); Bson filterCriteria = Filters.and( Filters.and(Filters.exists("summary"), Filters.ne("summary", null), Filters.ne("summary", "")), Filters.exists("embeddings", false)); try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) { List<String> summaries = new ArrayList<>(); List<String> documentIds = new ArrayList<>(); int i = 0; while (cursor.hasNext()) { Document document = cursor.next(); String summary = document.getString("summary"); String id = document.get("_id").toString(); summaries.add(summary); documentIds.add(id); i++; } System.out.println("Generating embeddings for " + summaries.size() + " documents."); System.out.println("This operation may take up to several minutes."); EmbeddingProvider embeddingProvider = new EmbeddingProvider(); List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries); List<WriteModel<Document>> updateDocuments = new ArrayList<>(); for (int j = 0; j < summaries.size(); j++) { UpdateOneModel<Document> updateDoc = new UpdateOneModel<>( Filters.eq("_id", documentIds.get(j)), Updates.set("embeddings", embeddings.get(j))); updateDocuments.add(updateDoc); } int updatedDocsCount = 0; try { BulkWriteOptions options = new BulkWriteOptions().ordered(false); BulkWriteResult result = collection.bulkWrite(updateDocuments, options); updatedDocsCount = result.getModifiedCount(); } catch (MongoException me) { throw new RuntimeException("Failed to insert documents", me); } System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents."); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } }
生成嵌入。
保存并运行该文件。 输出类似如下所示:
Generating embeddings for 50 documents. This operation may take up to several minutes. Added embeddings successfully to 50 documents.
您还可以导航到集群中的 sample_airbnb.listingsAndReviews
集合,在 Atlas UI 中查看向量嵌入。
创建一个名为 create-embeddings.js
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。
具体来说,这段代码使用您定义的 getEmbedding
函数和 MongoDB Node.js 驱动程序,从示例文本数组中生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings
集合中。
import { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // Data to embed const data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] async function run() { // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); try { await client.connect(); const db = client.db("sample_db"); const collection = db.collection("embeddings"); await Promise.all(data.map(async text => { // Check if the document already exists const existingDoc = await collection.findOne({ text: text }); // Generate an embedding by using the function that you defined const embedding = await getEmbedding(text); // Ingest data and embedding into Atlas if (!existingDoc) { await collection.insertOne({ text: text, embedding: embedding }); console.log(embedding); } })); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
保存并运行该文件。
node --env-file=.env create-embeddings.js
[ -0.04323853924870491, -0.008460805751383305, 0.012494648806750774, -0.013014335185289383, ... ] [ -0.017400473356246948, 0.04922063276171684, -0.002836339408531785, -0.030395228415727615, ... ] [ -0.016950927674770355, 0.013881809078156948, -0.022074559703469276, -0.02838018536567688, ... ]
node --env-file=.env create-embeddings.js
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ] [ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ] [ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]
注意
为了便于阅读,输出中的维数已被截断。
您还可以导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
创建一个名为 create-embeddings.js
的文件并粘贴以下代码。
使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您定义的
getEmbedding
函数,从每个文档的summary
字段生成嵌入。使用 MongoDB Node.js 驱动程序,用包含嵌入值的新
embedding
字段更新每个文档。
import { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // Connect to your Atlas cluster const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { await client.connect(); const db = client.db("sample_airbnb"); const collection = db.collection("listingsAndReviews"); // Filter to exclude null or empty summary fields const filter = { "summary": { "$nin": [ null, "" ] } }; // Get a subset of documents from the collection const documents = await collection.find(filter).limit(50).toArray(); // Create embeddings from a field in the collection let updatedDocCount = 0; console.log("Generating embeddings for documents..."); await Promise.all(documents.map(async doc => { // Generate an embedding by using the function that you defined const embedding = await getEmbedding(doc.summary); // Update the document with a new embedding field await collection.updateOne({ "_id": doc._id }, { "$set": { "embedding": embedding } } ); updatedDocCount += 1; })); console.log("Count of documents updated: " + updatedDocCount); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir);
保存并运行该文件。
node --env-file=.env create-embeddings.js
Generating embeddings for documents... Count of documents updated: 50
sample_airbnb.listingsAndReviews
您可以导航到Atlas用户界面中的 集合并展开文档中的字段,以查看生成的向量嵌入。
加载示例数据。
将以下代码粘贴到您的笔记本中并运行:
# Sample data sentences = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.", "The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.", "Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.", "Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.", "The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.", "Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.", "The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.", "Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.", "The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.", "Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.", "The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.", "E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.", "Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.", "Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.", "Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.", "Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.", "Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory." ]
为数据生成嵌入。
请使用以下代码从新数据生成嵌入。
具体来说,此代码使用您定义的get_embedding
函数和PyMongo驱动程序从示例文本大量生成嵌入。
import pymongo from sentence_transformers.quantization import quantize_embeddings float32_embeddings = get_embedding(sentences, precision="float32") int8_embeddings = get_embedding(sentences, precision="int8") int1_embeddings = get_embedding(sentences, precision="ubinary") # Print stored embeddings print("Generated embeddings stored in different variables:") for i, text in enumerate(sentences): print(f"\nText: {text}") print(f"Float32 Embedding: {float32_embeddings[i][:3]}... (truncated)") print(f"Int8 Embedding: {int8_embeddings[i][:3]}... (truncated)") print(f"Ubinary Embedding: {int1_embeddings[i][:3]}... (truncated)")
Generated embeddings stored in different variables: Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Float32 Embedding: [-0.01089042 0.05926645 -0.00291325]... (truncated) Int8 Embedding: [-15 127 56]... (truncated) Ubinary Embedding: [ 77 30 209]... (truncated) Text: The Lion King: Lion cub and future king Simba searches for his identity Float32 Embedding: [-0.05607051 -0.01360618 0.00523855]... (truncated) Int8 Embedding: [-128 -109 110]... (truncated) Ubinary Embedding: [ 37 18 151]... (truncated) Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Float32 Embedding: [-0.0275258 0.01144342 -0.02360895]... (truncated) Int8 Embedding: [-57 -28 -79]... (truncated) Ubinary Embedding: [ 76 16 144]... (truncated) Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious. Float32 Embedding: [-0.01759741 0.03254957 -0.02090798]... (truncated) Int8 Embedding: [-32 40 -61]... (truncated) Ubinary Embedding: [ 77 27 176]... (truncated) Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son. Float32 Embedding: [ 0.00503172 0.04311579 -0.00074904]... (truncated) Int8 Embedding: [23 74 70]... (truncated) Ubinary Embedding: [215 26 145]... (truncated) Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life. Float32 Embedding: [0.02349479 0.05669326 0.00458773]... (truncated) Int8 Embedding: [ 69 118 105]... (truncated) Ubinary Embedding: [237 154 159]... (truncated) Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry. Float32 Embedding: [-0.03294644 0.02671233 -0.01864981]... (truncated) Int8 Embedding: [-70 21 -47]... (truncated) Ubinary Embedding: [ 77 90 146]... (truncated) Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers. Float32 Embedding: [-0.02489671 0.02847196 -0.00290637]... (truncated) Int8 Embedding: [-50 27 56]... (truncated) Ubinary Embedding: [ 95 154 129]... (truncated) Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire. Float32 Embedding: [-0.01235448 0.01524397 -0.01063425]... (truncated) Int8 Embedding: [-19 -15 5]... (truncated) Ubinary Embedding: [ 68 26 210]... (truncated) Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover. Float32 Embedding: [ 0.04665203 0.01392298 -0.01743002]... (truncated) Int8 Embedding: [127 -20 -39]... (truncated) Ubinary Embedding: [207 88 208]... (truncated) Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past. Float32 Embedding: [0.00929601 0.04206405 0.00701248]... (truncated) Int8 Embedding: [ 34 71 121]... (truncated) Ubinary Embedding: [228 90 130]... (truncated) Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy. Float32 Embedding: [-0.01451324 -0.00897367 0.0077793 ]... (truncated) Int8 Embedding: [-24 -94 127]... (truncated) Ubinary Embedding: [ 57 150 32]... (truncated) Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love. Float32 Embedding: [-0.01458643 0.03639758 -0.02587282]... (truncated) Int8 Embedding: [-25 52 -94]... (truncated) Ubinary Embedding: [ 78 218 216]... (truncated) Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer. Float32 Embedding: [-0.00205381 -0.00039482 -0.01630799]... (truncated) Int8 Embedding: [ 6 -66 -31]... (truncated) Ubinary Embedding: [ 9 82 154]... (truncated) Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home. Float32 Embedding: [ 0.01105334 0.00776658 -0.03092942]... (truncated) Int8 Embedding: [ 38 -40 -128]... (truncated) Ubinary Embedding: [205 24 146]... (truncated) Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action. Float32 Embedding: [ 0.00266668 -0.01926583 -0.00727963]... (truncated) Int8 Embedding: [ 17 -128 27]... (truncated) Ubinary Embedding: [148 82 194]... (truncated) Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family. Float32 Embedding: [-0.00031873 -0.01352339 -0.02882693]... (truncated) Int8 Embedding: [ 10 -109 -114]... (truncated) Ubinary Embedding: [ 12 26 144]... (truncated) Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion. Float32 Embedding: [ 0.00957429 0.01855557 -0.02353773]... (truncated) Int8 Embedding: [ 34 -5 -79]... (truncated) Ubinary Embedding: [212 18 144]... (truncated) Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude. Float32 Embedding: [-0.01787405 0.03672816 -0.00972007]... (truncated) Int8 Embedding: [-33 53 11]... (truncated) Ubinary Embedding: [ 68 154 145]... (truncated) Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory. Float32 Embedding: [-0.03515214 -0.00503571 0.00183181]... (truncated) Int8 Embedding: [-76 -81 87]... (truncated) Ubinary Embedding: [ 35 222 152]... (truncated)
从嵌入中生成 BSON 向量。
使用以下代码将生成的向量嵌入转换为BSON向量。该代码使用PyMongo驾驶员。
具体来说,此代码将生成的嵌入转换为 float32
、int8
和位封装的 int1
类型,然后量化 float32
、int8
和 int1
向量。
from bson.binary import Binary, BinaryVectorDtype bson_float32_embeddings = [] bson_int8_embeddings = [] bson_int1_embeddings = [] # Convert each embedding to BSON for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings): bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32)) bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8)) bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT)) # Print the embeddings for idx, text in enumerate(sentences): print(f"\nText: {text}") print(f"Float32 BSON: {bson_float32_embeddings[idx]}") print(f"Int8 BSON: {bson_int8_embeddings[idx]}") print(f"Int1 BSON: {bson_int1_embeddings[idx]}")
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Float32 BSON: b'\'\x00\xbam2\xbc`\xc1r=7\xec>\xbb\xe6\xf3\x...' Int8 BSON: b'\x03\x00\xf1\x7f8\xdf\xfeC\x1e>\xef\xd6\xf5\x9...' Int1 BSON: b'\x10\x00M\x1e\xd1\xd2\x05\xaeq\xdf\x9a\x1d\xbc...' Text: The Lion King: Lion cub and future king Simba searches for his identity Float32 BSON: b'\'\x001\xaae\xbdr\xec^\xbc"\xa8\xab;\x91\xd...' Int8 BSON: b'\x03\x00\x80\x93n\x06\x80\xca\xd3.\xa2\xe3\xd1...' Int1 BSON: b'\x10\x00%\x12\x97\xa6\x8f\xdf\x89\x9d2\xcb\x99...' Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Float32 BSON: b'\'\x00\xcc}\xe1\xbc-};<\x8eg\xc1\xbc\xcb\xd...' Int8 BSON: b'\x03\x00\xc7\xe4\xb1\xdf/\xe2\xd2\x90\xf7\x02|...' Int1 BSON: b'\x10\x00L\x10\x90\xb6\x0f\x8a\x91\xaf\x92|\xf9...' Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious. Float32 BSON: b'\'\x00o(\x90\xbc\xb3R\x05=8G\xab\xbc\xfb\xc...' Int8 BSON: b'\x03\x00\xe0(\xc3\x10*\xda\xfe\x19\xbf&<\xd1\x...' Int1 BSON: b'\x10\x00M\x1b\xb0\x86\rn\x93\xaf:w\x9f}\x92\xd...' Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son. Float32 BSON: b'\'\x00\x1d\xe1\xa4;0\x9a0=C[D\xba\xb5\xf2\x...' Int8 BSON: b'\x03\x00\x17JF2\xb9\xddZ8\xa1\x0c\xc6\x80\xd8$...' Int1 BSON: b'\x10\x00\xd7\x1a\x91\x87\x0e\xc9\x91\x8b\xba\x...' Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life. Float32 BSON: b'\'\x00#x\xc0<27h=\xb5T\x96;:\xc4\x9c\xbd\x1...' Int8 BSON: b'\x03\x00Evi\x80\x13\xd6\x1cCW\x80\x01\x9e\xe58...' Int1 BSON: b'\x10\x00\xed\x9a\x9f\x97\x1f.\x12\xf9\xba];\x7...' Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry. Float32 BSON: b'\'\x00\xd9\xf2\x06\xbd\xd2\xd3\xda<\x7f\xc7...' Int8 BSON: b'\x03\x00\xba\x15\xd1-\x0c\x03\xe6\xea\rQ\x1f\x...' Int1 BSON: b'\x10\x00MZ\x92\xb7#\xaa\x99=\x9a\x99\x9c|<\xf8...' Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers. Float32 BSON: b'\'\x00/\xf4\xcb\xbc\t>\xe9<\xc9x>\xbb\xcc\x...' Int8 BSON: b'\x03\x00\xce\x1b815\xcf1\xc6s\xe5\n\xe4\x192G\...' Int1 BSON: b'\x10\x00_\x9a\x81\xa6\x0f\x0f\x93o2\xd8\xfe|\x...' Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire. Float32 BSON: b'\'\x00sjJ\xbc\xd6\xc1y<I;.\xbc\xb1\x80\t\xb...' Int8 BSON: b'\x03\x00\xed\xf1\x05\xe2\xc7\xfa\xd4\xab5\xeb\...' Int1 BSON: b'\x10\x00D\x1a\xd2\x86\x0ey\x92\x8f\xaa\x89\x1c...' Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover. Float32 BSON: b'\'\x004\x16?=9\x1dd<g\xc9\x8e\xbc\xdf\x81\x...' Int8 BSON: b'\x03\x00\x7f\xec\xd9\xdc)\xd6)\x05\x18\x7f\xa6...' Int1 BSON: b"\x10\x00\xcfX\xd0\xb7\x0e\xcf\xd9\r\xf0U\xb4]6..." Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past. Float32 BSON: b'\'\x00HN\x18<ZK,=\xf9\xc8\xe5;\x9e\xed\xa0\...' Int8 BSON: b'\x03\x00"Gy\x01\xeb\xec\xfc\x80\xe4a\x7f\x88\x...' Int1 BSON: b'\x10\x00\xe4Z\x82\xb6\xad\xec\x10-\x9a\x99;?j\...' Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy. Float32 BSON: b'\'\x00\xef\xc8m\xbcJ\x06\x13\xbcv\xe9\xfe;...' Int8 BSON: b'\x03\x00\xe8\xa2\x7fIE\xba\x9f\xfaT2\xf1\xc1\...' Int1 BSON: b'\x10\x009\x96 \xb7\x8e\xc9\x81\xaf\xaa\x9f\xa...' Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love. Float32 BSON: b'\'\x00\xee\xfbn\xbc\xa0\x15\x15=<\xf3\xd3x...' Int8 BSON: b'\x03\x00\xe74\xa2\xe5\x15\x165\xb9dM8C\xd7E\x...' Int1 BSON: b'\x10\x00N\xda\xd8\xb6\x03N\x98\xbd\xdaY\x1b| ...' Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer. Float32 BSON: b'\'\x002\x99\x06\xbb\x82\x00\xcf\xb9X\x98\x...' Int8 BSON: b'\x03\x00\x06\xbe\xe1.\x7f\x80\x04C\xd7e\x80\x...' Int1 BSON: b'\x10\x00\tR\x9a\xd6\x0c\xb1\x9a\xbc\x90\xf5\x...' Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home. Float32 BSON: b'\'\x00\x14\x195<\xd4~\xfe;\xb3_\xfd\xbc \xe...' Int8 BSON: b'\x03\x00&\xd8\x80\x92\x01\x7f\xbfF\xd4\x10\xf0...' Int1 BSON: b'\x10\x00\xcd\x18\x92\x92\x8dJ\x92\xbd\x9a\xd3\...' Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action. Float32 BSON: b'\'\x00\x8a\xc3.;_\xd3\x9d\xbc\xf2\x89\xee\x...' Int8 BSON: b'\x03\x00\x11\x80\x1b5\xe9\x19\x80\x8f\xb1N\xda...' Int1 BSON: b"\x10\x00\x94R\xc2\xd2\x0f\xfa\x90\xbc\xd8\xd6\...' Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family. Float32 BSON: b'\'\x00\xe6\x1a\xa7\xb94\x91]\xbcs&\xec\xbc\...' Int8 BSON: b'\x03\x00\n\x93\x8e,n\xce\xe8\x9b@\x00\xf9\x7f\...' Int1 BSON: b'\x10\x00\x0c\x1a\x90\x97\x0f\x19\x80/\xba\x98\...' Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion. Float32 BSON: b'\'\x00\x7f\xdd\x1c<\xd9\x01\x98<1\xd2\xc0\xb...' Int8 BSON: b'\x03\x00"\xfb\xb1\x7f\xd3\xd6\x04\xbe\x80\xf9L\...' Int1 BSON: b'\x10\x00\xd4\x12\x90\xa6\x8by\x99\x8d\xa2\xbd\x...' Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude. Float32 BSON: b'\'\x00\x98l\x92\xbcDp\x16=\xf0@\x1f\xbc\xd0\...' Int8 BSON: b'\x03\x00\xdf5\x0b\xe3\xbf\xe5\xa5\xad\x7f\x02\x...' Int1 BSON: b'\x10\x00D\x9a\x91\x96\x07\xfa\x93\x8d\xb2D\x92]...' Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory. Float32 BSON: b'\'\x00\xb0\xfb\x0f\xbd\x9b\x02\xa5\xbbZ\x19\...' Int8 BSON: b'\x03\x00\xb4\xafW\xd9\xd7\xc3\x7f~QM\x86\x83\xf...' Int1 BSON: b'\x10\x00#\xde\x98\x96\x0e\xcc\x12\xf6\xbb\xdd2}...'
连接到Atlas 集群以上传数据。
在笔记本中粘贴以下代码,将<connection-string>
替换为Atlas集群的 SRV连接字符串,然后运行代码。
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_db"] collection = db["embeddings"] # Ingest data into Atlas collection.insert_many(docs)
InsertManyResult([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], acknowledged=True)
您可以在集群sample_db.embeddings
命名空间空间内的Atlas用户用户界面中查看向量嵌入,以验证向量嵌入。
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
请将以下代码粘贴到您的笔记本中。
请使用以下代码从新数据生成嵌入。
具体来说,这段代码使用您定义的 get_embedding
函数和 MongoDB PyMongo 驱动程序,从示例文本数组中生成嵌入,并将其摄取到 sample_db.embeddings
集合中。
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_db"] collection = db["embeddings"] # Sample data data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] # Ingest data into Atlas inserted_doc_count = 0 for text in data: embedding = get_embedding(text) collection.insert_one({ "text": text, "embedding": embedding }) inserted_doc_count += 1 print(f"Inserted {inserted_doc_count} documents.")
Inserted 3 documents.
指定连接字符串。
将 <connection-string>
替换为 Atlas 集群的 SRV 连接字符串。
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
运行代码。
您可以通过导航到集群中的 sample_db.embeddings
集合,在 Atlas UI 中查看向量嵌入来验证它们。
加载现有数据。
从Atlas 集群获取数据并将其加载到笔记本中。以下代码从 sample_airbnb.listingAndReviews
集合的 50
文档中获取 summary
字段。
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_airbnb"] collection = db["listingsAndReviews"] # Define a filter to exclude documents with null or empty 'summary' fields summary_filter = {"summary": {"$nin": [None, ""]}} # Get a subset of documents in the collection documents = collection.find(summary_filter, {'_id': 1, 'summary': 1}).limit(50)
生成、转换嵌入并将其引入Atlas 集群。
以下代码首先为数据生成float32
、int8
和int1
嵌入,然后将这些嵌入转换为BSONfloat32
、int8
和int1
子类型,最后将这些嵌入上传到Atlas 集群。
注意
该操作可能需要几分钟才能完成。
import pymongo from bson.binary import Binary updated_doc_count = 0 for doc in documents: summary = doc["summary"] # Generate float32 and other embeddings float32_embeddings = get_embedding(summary, precision="float32") int8_embeddings = get_embedding(summary, precision="int8") int1_embeddings = get_embedding(summary, precision="ubinary") # Initialize variables to store BSON vectors bson_float32_embeddings = generate_bson_vector(float32_embeddings, BinaryVectorDtype.FLOAT32) bson_int8_embeddings = generate_bson_vector(int8_embeddings, BinaryVectorDtype.INT8) bson_int1_embeddings = generate_bson_vector(int1_embeddings, BinaryVectorDtype.PACKED_BIT) # Update each document with new embedding fields collection.update_one( {"_id": doc["_id"]}, {"$set": { "BSON-Float32-Embedding": bson_float32_embeddings, "BSON-Int8-Embedding": bson_int8_embeddings, "BSON-Int1-Embedding": bson_int1_embeddings }}, ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.
注意
此示例使用示例数据中的 sample_airbnb.listingsAndReviews
集合,但您可以调整代码以适用于集群中的任何集合。
请将以下代码粘贴到您的笔记本中。
使用以下代码从现有集合中的字段生成嵌入。具体而言,此代码执行以下操作:
连接到您的 Atlas 集群。
从
sample_airbnb.listingsAndReviews
集合中获取具有非空summary
字段的文档子集。使用您定义的
get_embedding
函数,从每个文档的summary
字段生成嵌入。使用 MongoDB PyMongo 驱动程序,用包含嵌入值的新
embedding
字段更新每个文档。
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_airbnb"] collection = db["listingsAndReviews"] # Filter to exclude null or empty summary fields filter = { "summary": {"$nin": [ None, "" ]} } # Get a subset of documents in the collection documents = collection.find(filter).limit(50) # Update each document with a new embedding field updated_doc_count = 0 for doc in documents: embedding = get_embedding(doc["summary"]) collection.update_one( { "_id": doc["_id"] }, { "$set": { "embedding": embedding } } ) updated_doc_count += 1 print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.
指定连接字符串。
将 <connection-string>
替换为 Atlas 集群的 SRV 连接字符串。
注意
连接字符串应使用以下格式:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
运行代码。
sample_airbnb.listingsAndReviews
您可以导航到Atlas用户界面中的 集合并展开文档中的字段,以查看生成的向量嵌入。
为查询创建嵌入
在本节中,您将对集合中的向量嵌入进行索引,并创建一个嵌入,用于运行示例向量搜索查询。
运行查询时,Atlas Vector Search 会返回嵌入距离与向量搜索查询中的嵌入距离最接近的文档。这表明它们的含义相似。
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 dotProduct
。
粘贴以下代码以将
CreateVectorIndex
函数添加到DataService.cs
中的DataService
类。DataService.csnamespace MyCompany.Embeddings; using MongoDB.Driver; using MongoDB.Bson; public class DataService { private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING"); private static readonly MongoClient Client = new MongoClient(ConnectionString); private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db"); private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings"); public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings) { // Method details... } public void CreateVectorIndex() { try { var searchIndexView = Collection.SearchIndexes; var name = "vector_index"; var type = SearchIndexType.VectorSearch; var definition = new BsonDocument { { "fields", new BsonArray { new BsonDocument { { "type", "vector" }, { "path", "embedding" }, { "numDimensions", <dimensions> }, { "similarity", "dotProduct" } } } } }; var model = new CreateSearchIndexModel(name, type, definition); searchIndexView.CreateOne(model); Console.WriteLine($"New search index named {name} is building."); // Polling for index status Console.WriteLine("Polling to check if the index is ready. This may take up to a minute."); bool queryable = false; while (!queryable) { var indexes = searchIndexView.List(); foreach (var index in indexes.ToEnumerable()) { if (index["name"] == name) { queryable = index["queryable"].AsBoolean; } } if (!queryable) { Thread.Sleep(5000); } } Console.WriteLine($"{name} is ready for querying."); } catch (Exception e) { Console.WriteLine($"Exception: {e.Message}"); } } } 如果使用开源模型,则将
<dimensions>
占位符值替换为1024
;如果使用 OpenAI 模型,则将占位符值替换为1536
。更新
Program.cs
中的代码。删除填充初始文档的代码,并将其替换为以下代码以创建索引:
Program.csusing MyCompany.Embeddings; var dataService = new DataService(); dataService.CreateVectorIndex(); 保存文件,然后编译并运行项目以创建索引:
dotnet run MyCompany.Embeddings New search index named vector_index is building. Polling to check if the index is ready. This may take up to a minute. vector_index is ready for querying.
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
粘贴以下代码以将
PerformVectorQuery
函数添加到DataService.cs
中的DataService
类。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
DataService.csnamespace MyCompany.Embeddings; using MongoDB.Driver; using MongoDB.Bson; public class DataService { private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING"); private static readonly MongoClient Client = new MongoClient(ConnectionString); private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db"); private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings"); public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings) { // Method details... } public void CreateVectorIndex() { // Method details... } public List<BsonDocument>? PerformVectorQuery(float[] vector) { var vectorSearchStage = new BsonDocument { { "$vectorSearch", new BsonDocument { { "index", "vector_index" }, { "path", "embedding" }, { "queryVector", new BsonArray(vector) }, { "exact", true }, { "limit", 5 } } } }; var projectStage = new BsonDocument { { "$project", new BsonDocument { { "_id", 0 }, { "text", 1 }, { "score", new BsonDocument { { "$meta", "vectorSearchScore"} } } } } }; var pipeline = new[] { vectorSearchStage, projectStage }; return Collection.Aggregate<BsonDocument>(pipeline).ToList(); } } 更新
Program.cs
中的代码。删除创建向量索引的代码,并添加代码以执行查询:
Program.csusing MongoDB.Bson; using MyCompany.Embeddings; var aiService = new AIService(); var queryString = "ocean tragedy"; var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]); if (!queryEmbedding.Any()) { Console.WriteLine("No embeddings found."); } else { var dataService = new DataService(); var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]); if (matchingDocuments == null) { Console.WriteLine("No documents matched the query."); } else { foreach (var document in matchingDocuments) { Console.WriteLine(document.ToJson()); } } } 保存文件,然后编译并运行项目以执行查询:
dotnet run MyCompany.Embeddings.csproj { "text" : "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "score" : 100.17414855957031 } { "text" : "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "score" : 65.705635070800781 } { "text" : "The Lion King: Lion cub and future king Simba searches for his identity", "score" : 52.486415863037109 }
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embeddings
字段指定为向量类型,将相似度测量指定为 euclidean
。
粘贴以下代码以将
CreateVectorIndex
函数添加到DataService.cs
中的DataService
类。DataService.csnamespace MyCompany.Embeddings; using MongoDB.Driver; using MongoDB.Bson; public class DataService { private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING"); private static readonly MongoClient Client = new MongoClient(ConnectionString); private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb"); private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews"); public List<BsonDocument>? GetDocuments() { // Method details... } public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings) { // Method details... } public void CreateVectorIndex() { try { var searchIndexView = Collection.SearchIndexes; var name = "vector_index"; var type = SearchIndexType.VectorSearch; var definition = new BsonDocument { { "fields", new BsonArray { new BsonDocument { { "type", "vector" }, { "path", "embeddings" }, { "numDimensions", <dimensions> }, { "similarity", "dotProduct" } } } } }; var model = new CreateSearchIndexModel(name, type, definition); searchIndexView.CreateOne(model); Console.WriteLine($"New search index named {name} is building."); // Polling for index status Console.WriteLine("Polling to check if the index is ready. This may take up to a minute."); bool queryable = false; while (!queryable) { var indexes = searchIndexView.List(); foreach (var index in indexes.ToEnumerable()) { if (index["name"] == name) { queryable = index["queryable"].AsBoolean; } } if (!queryable) { Thread.Sleep(5000); } } Console.WriteLine($"{name} is ready for querying."); } catch (Exception e) { Console.WriteLine($"Exception: {e.Message}"); } } } 如果使用开源模型,则将
<dimensions>
占位符值替换为1024
;如果使用 OpenAI 模型,则将占位符值替换为1536
。更新
Program.cs
中的代码。删除向现有文档添加嵌入的代码,并将其替换为以下代码以创建索引:
Program.csusing MyCompany.Embeddings; var dataService = new DataService(); dataService.CreateVectorIndex(); 保存文件,然后编译并运行项目以创建索引:
dotnet run MyCompany.Embeddings New search index named vector_index is building. Polling to check if the index is ready. This may take up to a minute. vector_index is ready for querying.
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
粘贴以下代码以将
PerformVectorQuery
函数添加到DataService.cs
中的DataService
类。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
DataService.csnamespace MyCompany.Embeddings; using MongoDB.Driver; using MongoDB.Bson; public class DataService { private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING"); private static readonly MongoClient Client = new MongoClient(ConnectionString); private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb"); private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews"); public List<BsonDocument>? GetDocuments() { // Method details... } public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings) { // Method details... } public void CreateVectorIndex() { // Method details... } public List<BsonDocument>? PerformVectorQuery(float[] vector) { var vectorSearchStage = new BsonDocument { { "$vectorSearch", new BsonDocument { { "index", "vector_index" }, { "path", "embeddings" }, { "queryVector", new BsonArray(vector) }, { "exact", true }, { "limit", 5 } } } }; var projectStage = new BsonDocument { { "$project", new BsonDocument { { "_id", 0 }, { "summary", 1 }, { "score", new BsonDocument { { "$meta", "vectorSearchScore"} } } } } }; var pipeline = new[] { vectorSearchStage, projectStage }; return Collection.Aggregate<BsonDocument>(pipeline).ToList(); } } 更新
Program.cs
中的代码。删除创建向量索引的代码,并添加代码以执行查询:
Program.csusing MongoDB.Bson; using MyCompany.Embeddings; var aiService = new AIService(); var queryString = "beach house"; var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]); if (!queryEmbedding.Any()) { Console.WriteLine("No embeddings found."); } else { var dataService = new DataService(); var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]); if (matchingDocuments == null) { Console.WriteLine("No documents matched the query."); } else { foreach (var document in matchingDocuments) { Console.WriteLine(document.ToJson()); } } } 保存文件,然后编译并运行项目以执行查询:
dotnet run MyCompany.Embeddings.csproj { "summary" : "Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.", "score" : 88.884147644042969 } { "summary" : "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score" : 86.136398315429688 } { "summary" : "Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.", "score" : 86.087783813476562 } { "summary" : "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score" : 85.689559936523438 } { "summary" : "Fully furnished 3+1 flat decorated with vintage style. Located at the heart of Moda/Kadıköy, close to seaside and also to the public transportation (tram, metro, ferry, bus stations) 10 minutes walk.", "score" : 85.614166259765625 }
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 dotProduct
。
创建一个名为
create-index.go
的文件并粘贴以下代码。create-index.gopackage main import ( "context" "fmt" "log" "os" "time" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embedding", NumDimensions: <dimensions>, Similarity: "dotProduct"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { fmt.Errorf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } 如果使用开源模型,则将
<dimensions>
占位符值替换为1024
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
go run create-index.go 2024/10/09 17:38:51 Creating the index. 2024/10/09 17:38:52 Polling to confirm successful index creation. 2024/10/09 17:39:22 Name of Index Created: vector_index
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.go
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextAndScore struct { Text string `bson:"text"` Score float32 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") query := "ocean tragedy" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embedding"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"text", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []TextAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score) } } vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type TextAndScore struct { Text string `bson:"text"` Score float64 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_db").Collection("embeddings") query := "ocean tragedy" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embedding"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"text", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []TextAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score) } } 保存文件,然后运行以下命令:
go run vector-query.go Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.0042472864 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.0031167597 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.0024476869 go run vector-query.go Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.4552372694015503 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.4050072133541107 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.35942140221595764
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embeddings
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
create-index.go
的文件并粘贴以下代码。create-index.gopackage main import ( "context" "fmt" "log" "os" "time" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) func main() { ctx := context.Background() if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") indexName := "vector_index" opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch") type vectorDefinitionField struct { Type string `bson:"type"` Path string `bson:"path"` NumDimensions int `bson:"numDimensions"` Similarity string `bson:"similarity"` } type vectorDefinition struct { Fields []vectorDefinitionField `bson:"fields"` } indexModel := mongo.SearchIndexModel{ Definition: vectorDefinition{ Fields: []vectorDefinitionField{{ Type: "vector", Path: "embeddings", NumDimensions: <dimensions>, Similarity: "dotProduct", Quantization: "scalar"}}, }, Options: opts, } log.Println("Creating the index.") searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel) if err != nil { log.Fatalf("failed to create the search index: %v", err) } // Await the creation of the index. log.Println("Polling to confirm successful index creation.") searchIndexes := coll.SearchIndexes() var doc bson.Raw for doc == nil { cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName)) if err != nil { fmt.Errorf("failed to list search indexes: %w", err) } if !cursor.Next(ctx) { break } name := cursor.Current.Lookup("name").StringValue() queryable := cursor.Current.Lookup("queryable").Boolean() if name == searchIndexName && queryable { doc = cursor.Current } else { time.Sleep(5 * time.Second) } } log.Println("Name of Index Created: " + searchIndexName) } 如果使用开源模型,则将
<dimensions>
占位符值替换为1024
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
go run create-index.go 2024/10/10 10:03:12 Creating the index. 2024/10/10 10:03:13 Polling to confirm successful index creation. 2024/10/10 10:03:44 Name of Index Created: vector_index
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.go
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type SummaryAndScore struct { Summary string `bson:"summary"` Score float32 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") query := "beach house" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embeddings"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"summary", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []SummaryAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score) } } vector-query.gopackage main import ( "context" "fmt" "log" "my-embeddings-project/common" "os" "github.com/joho/godotenv" "go.mongodb.org/mongo-driver/bson" "go.mongodb.org/mongo-driver/mongo" "go.mongodb.org/mongo-driver/mongo/options" ) type SummaryAndScore struct { Summary string `bson:"summary"` Score float64 `bson:"score"` } func main() { ctx := context.Background() // Connect to your Atlas cluster if err := godotenv.Load(); err != nil { log.Println("no .env file found") } // Connect to your Atlas cluster uri := os.Getenv("ATLAS_CONNECTION_STRING") if uri == "" { log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.") } clientOptions := options.Client().ApplyURI(uri) client, err := mongo.Connect(ctx, clientOptions) if err != nil { log.Fatalf("failed to connect to the server: %v", err) } defer func() { _ = client.Disconnect(ctx) }() // Set the namespace coll := client.Database("sample_airbnb").Collection("listingsAndReviews") query := "beach house" queryEmbedding := common.GetEmbeddings([]string{query}) pipeline := mongo.Pipeline{ bson.D{ {"$vectorSearch", bson.D{ {"queryVector", queryEmbedding[0]}, {"index", "vector_index"}, {"path", "embeddings"}, {"exact", true}, {"limit", 5}, }}, }, bson.D{ {"$project", bson.D{ {"_id", 0}, {"summary", 1}, {"score", bson.D{ {"$meta", "vectorSearchScore"}, }}, }}, }, } // Run the pipeline cursor, err := coll.Aggregate(ctx, pipeline) if err != nil { log.Fatalf("failed to run aggregation: %v", err) } defer func() { _ = cursor.Close(ctx) }() var matchingDocs []SummaryAndScore if err = cursor.All(ctx, &matchingDocs); err != nil { log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err) } for _, doc := range matchingDocs { fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score) } } 保存文件,然后运行以下命令:
go run vector-query.go Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living. Score: 0.0045180833 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.004480799 Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway. Score: 0.0042421296 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.004227752 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.0042201905 go run vector-query.go Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.4832950830459595 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.48093676567077637 Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!! Score: 0.4629695415496826 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.45800843834877014 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2. Score: 0.45398443937301636
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 dotProduct
。
创建一个名为
CreateIndex.java
的文件并粘贴以下代码:CreateIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoCursor; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class CreateIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); // define the index details String indexName = "vector_index"; int dimensionsHuggingFaceModel = 1024; int dimensionsOpenAiModel = 1536; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embedding") .append("numDimensions", <dimensions>) // replace with var for the model used .append("similarity", "dotProduct"))); // define the index model using the specified details SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // Create the index using the model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created a vector index named: " + result); } catch (Exception e) { throw new RuntimeException(e); } // Wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); System.out.println("It may take up to a minute for the index to build before you can query using it."); ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); Document doc = null; while (doc == null) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { doc = current; } else { Thread.sleep(500); } } catch (Exception e) { throw new RuntimeException(e); } } System.out.println(indexName + " index is ready to query"); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 将
<dimensions>
占位符值替换为所用模型的相应变量:dimensionsHuggingFaceModel
:1024
维度("mixedbread-ai/mxbai-embed-large-v1" 模型)dimensionsOpenAiModel
:1536
个维度("text-embedding-3-small" 模型)
保存并运行该文件。 输出类似如下所示:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
VectorQuery.java
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
VectorQuery.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class VectorQuery { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_db"); MongoCollection<Document> collection = database.getCollection("embeddings"); // define $vectorSearch query options String query = "ocean tragedy"; EmbeddingProvider embeddingProvider = new EmbeddingProvider(); BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query); List<Double> embedding = new ArrayList<>(); for (BsonValue value : embeddingBsonArray.stream().toList()) { embedding.add(value.asDouble().getValue()); } // define $vectorSearch pipeline String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embedding"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, embedding, indexName, limit, exactVectorSearchOptions() ), project( fields(exclude("_id"), include("text"), metaVectorSearchScore("score")))); // run query and print results List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); if (results.isEmpty()) { System.out.println("No results found."); } else { results.forEach(doc -> { System.out.println("Text: " + doc.getString("text")); System.out.println("Score: " + doc.getDouble("score")); }); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 保存并运行文件。输出类似于以下内容之一,具体取决于您使用的模型:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.004247286356985569 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.003116759704425931 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.002447686856612563 Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built Score: 0.45522359013557434 Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission Score: 0.4049977660179138 Text: The Lion King: Lion cub and future king Simba searches for his identity Score: 0.35942474007606506
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embeddings
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
CreateIndex.java
的文件并粘贴以下代码:CreateIndex.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.ListSearchIndexesIterable; import com.mongodb.client.MongoCursor; import com.mongodb.client.model.SearchIndexModel; import com.mongodb.client.model.SearchIndexType; import org.bson.Document; import org.bson.conversions.Bson; import java.util.Collections; import java.util.List; public class CreateIndex { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); // define the index details String indexName = "vector_index"; int dimensionsHuggingFaceModel = 1024; int dimensionsOpenAiModel = 1536; Bson definition = new Document( "fields", Collections.singletonList( new Document("type", "vector") .append("path", "embeddings") .append("numDimensions", <dimensions>) // replace with var for the model used .append("similarity", "dotProduct") .append('quantization', "scalar"))); // define the index model using the specified details SearchIndexModel indexModel = new SearchIndexModel( indexName, definition, SearchIndexType.vectorSearch()); // create the index using the model try { List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel)); System.out.println("Successfully created a vector index named: " + result); System.out.println("It may take up to a minute for the index to build before you can query using it."); } catch (Exception e) { throw new RuntimeException(e); } // wait for Atlas to build the index and make it queryable System.out.println("Polling to confirm the index has completed building."); ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes(); Document doc = null; while (doc == null) { try (MongoCursor<Document> cursor = searchIndexes.iterator()) { if (!cursor.hasNext()) { break; } Document current = cursor.next(); String name = current.getString("name"); boolean queryable = current.getBoolean("queryable"); if (name.equals(indexName) && queryable) { doc = current; } else { Thread.sleep(500); } } catch (Exception e) { throw new RuntimeException(e); } } System.out.println(indexName + " index is ready to query"); } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 将
<dimensions>
占位符值替换为所用模型的相应变量:dimensionsHuggingFaceModel
:1024
维度(开源)dimensionsOpenAiModel
:1536
维度
保存并运行该文件。 输出类似如下所示:
Successfully created a vector index named: [vector_index] Polling to confirm the index has completed building. It may take up to a minute for the index to build before you can query using it. vector_index index is ready to query
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
VectorQuery.java
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
VectorQuery.javaimport com.mongodb.MongoException; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; import com.mongodb.client.MongoDatabase; import com.mongodb.client.model.search.FieldSearchPath; import org.bson.BsonArray; import org.bson.BsonValue; import org.bson.Document; import org.bson.conversions.Bson; import java.util.ArrayList; import java.util.List; import static com.mongodb.client.model.Aggregates.project; import static com.mongodb.client.model.Aggregates.vectorSearch; import static com.mongodb.client.model.Projections.exclude; import static com.mongodb.client.model.Projections.fields; import static com.mongodb.client.model.Projections.include; import static com.mongodb.client.model.Projections.metaVectorSearchScore; import static com.mongodb.client.model.search.SearchPath.fieldPath; import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions; import static java.util.Arrays.asList; public class VectorQuery { public static void main(String[] args) { String uri = System.getenv("ATLAS_CONNECTION_STRING"); if (uri == null || uri.isEmpty()) { throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty."); } // establish connection and set namespace try (MongoClient mongoClient = MongoClients.create(uri)) { MongoDatabase database = mongoClient.getDatabase("sample_airbnb"); MongoCollection<Document> collection = database.getCollection("listingsAndReviews"); // define the query and get the embedding String query = "beach house"; EmbeddingProvider embeddingProvider = new EmbeddingProvider(); BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query); List<Double> embedding = new ArrayList<>(); for (BsonValue value : embeddingBsonArray.stream().toList()) { embedding.add(value.asDouble().getValue()); } // define $vectorSearch pipeline String indexName = "vector_index"; FieldSearchPath fieldSearchPath = fieldPath("embeddings"); int limit = 5; List<Bson> pipeline = asList( vectorSearch( fieldSearchPath, embedding, indexName, limit, exactVectorSearchOptions()), project( fields(exclude("_id"), include("summary"), metaVectorSearchScore("score")))); // run query and print results List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>()); if (results.isEmpty()) { System.out.println("No results found."); } else { results.forEach(doc -> { System.out.println("Summary: " + doc.getString("summary")); System.out.println("Score: " + doc.getDouble("score")); }); } } catch (MongoException me) { throw new RuntimeException("Failed to connect to MongoDB ", me); } catch (Exception e) { throw new RuntimeException("Operation failed: ", e); } } } 保存并运行文件。输出类似于以下内容之一,具体取决于您使用的模型:
Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living. Score: 0.004518083296716213 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.0044807991944253445 Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway. Score: 0.004242129623889923 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.004227751865983009 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.004220190457999706 Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses. Score: 0.4832950830459595 Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily. Score: 0.48092085123062134 Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!! Score: 0.4629460275173187 Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door. Score: 0.4581468403339386 Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2. Score: 0.45398443937301636
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_db.embeddings
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 dotProduct
。
创建一个名为
create-index.js
的文件并粘贴以下代码。create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("sample_db"); const collection = database.collection("embeddings"); // define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": <dimensions> } ] } } // run the helper method const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); 如果使用开源模型,则将
<dimensions>
占位符值替换为768
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
node --env-file=.env create-index.js
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.js
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // MongoDB connection URI and options const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_db"); const collection = database.collection("embeddings"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("ocean tragedy"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, text: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); 保存文件,然后运行以下命令:
node --env-file=.env vector-query.js '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}' '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}' '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}' node --env-file=.env vector-query.js {"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926} {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898} {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
完成以下步骤,在 sample_airbnb.listingsAndReviews
集合上创建索引,将 embedding
字段指定为向量类型,将相似度测量指定为 euclidean
。
创建一个名为
create-index.js
的文件并粘贴以下代码。create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { const database = client.db("sample_airbnb"); const collection = database.collection("listingsAndReviews"); // Define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": <dimensions>, "quantization": "scalar" } ] } } // Call the method to create the index const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); 如果使用开源模型,则将
<dimensions>
占位符值替换为768
;如果使用 OpenAI 模型,则将占位符值替换为1536
。保存文件,然后运行以下命令:
node --env-file=.env create-index.js
注意
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
创建一个名为
vector-query.js
的文件并粘贴以下代码。要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './get-embeddings.js'; // MongoDB connection URI and options const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_airbnb"); const collection = database.collection("listingsAndReviews"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("beach house"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, summary: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); 保存文件,然后运行以下命令:
node --env-file=.env vector-query.js '{"summary":"Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.","score":0.5334879159927368}' '{"summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":0.5240535736083984}' '{"summary":"The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.","score":0.5232879519462585}' '{"summary":"Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.","score":0.5186381340026855}' '{"summary":"A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.","score":0.5078228116035461}' node --env-file=.env vector-query.js {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646} {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码会在集合上创建索引,并指定以下内容:
BSON-Float32-Embedding
、BSON-Int8-Embedding
和BSON-Int1-Embedding
字段作为向量类型字段。euclidean
作为int1
嵌入的相似度函数,dotProduct
作为float32
和int8
嵌入的相似度类型。768
作为嵌入中的维数。
from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "BSON-Float32-Embedding", "similarity": "dotProduct", "numDimensions": 768 }, { "type": "vector", "path": "BSON-Int8-Embedding", "similarity": "dotProduct", "numDimensions": 768 }, { "type": "vector", "path": "BSON-Int1-Embedding", "similarity": "euclidean", "numDimensions": 768 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码在集合上创建索引,并将
embedding
字段指定为向量类型,将相似度函数指定为dotProduct
,将维数指定为1536
。它还为embedding
字段中的向量启用自动标量量化,以高效处理向量数据。from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 1536, "quantization": "scalar" } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入,然后运行查询。
要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将查询嵌入转换为BSON
float32
、int8
和int1
向量子类型。将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
注意
查询可能需要一些时间才能完成。
import pymongo from bson.binary import Binary # Prepare your query query_text = "ocean tragedy" # Generate embedding for the search query query_float32_embeddings = get_embedding(query_text, precision="float32") query_int8_embeddings = get_embedding(query_text, precision="int8") query_int1_embeddings = get_embedding(query_text, precision="ubinary") # Convert each embedding to BSON format query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32) query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8) query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT) # Define vector search pipeline for each precision pipelines = [] for query_embedding, path in zip( [query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings], ["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"] ): pipeline = [ { "$vectorSearch": { "index": "vector_index", # Adjust if necessary "queryVector": query_embedding, "path": path, "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "data": 1, "score": { "$meta": "vectorSearchScore" } } } ] pipelines.append(pipeline) # Execute the search for each precision for pipeline in pipelines: print(f"Results for {pipeline[0]['$vectorSearch']['path']}:") results = collection.aggregate(pipeline) # Print results for i in results: print(i)
Results for BSON-Float32-Embedding: {'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.7661113739013672} {'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.7050272822380066} {'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.7024770379066467} {'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.7011005282402039} {'data': 'E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.', 'score': 0.6877288818359375} Results for BSON-Int8-Embedding: {'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.5} {'data': 'The Lion King: Lion cub and future king Simba searches for his identity', 'score': 0.5} {'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.5} {'data': "Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.", 'score': 0.5} {'data': 'The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.', 'score': 0.5} Results for BSON-Int1-Embedding: {'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.671875} {'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.6484375} {'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.640625} {'data': 'Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.', 'score': 0.6145833134651184} {'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.61328125}
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926} {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898} {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码会在集合上创建索引,并指定以下内容:
BSON-Float32-Embedding
、BSON-Int8-Embedding
和BSON-Int1-Embedding
字段作为向量类型字段。euclidean
作为int1
嵌入的相似度函数,dotProduct
作为float32
和int8
嵌入的相似度类型。768
作为嵌入中的维数。
from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "BSON-Float32-Embedding", "similarity": "dotProduct", "numDimensions": 768 }, { "type": "vector", "path": "BSON-Int8-Embedding", "similarity": "dotProduct", "numDimensions": 768 }, { "type": "vector", "path": "BSON-Int1-Embedding", "similarity": "euclidean", "numDimensions": 768 } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。
请将以下代码粘贴到您的笔记本中。
此代码在集合上创建索引,并将
embedding
字段指定为向量类型,将相似度函数指定为dotProduct
,将维数指定为1536
。它还为embedding
字段中的向量启用自动标量量化,以高效处理向量数据。from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "similarity": "dotProduct", "numDimensions": 1536, "quantization": "scalar" } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model) 运行代码。
构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。
有关更多信息,请参阅创建 Atlas Vector Search 索引。
为向量搜索查询创建嵌入并运行查询。
要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将查询嵌入转换为BSON
float32
、int8
和int1
向量子类型。将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
import pymongo from bson.binary import Binary # Prepare your query query_text = "beach house" # Generate embedding for the search query query_float32_embeddings = get_embedding(query_text, precision="float32") query_int8_embeddings = get_embedding(query_text, precision="int8") query_int1_embeddings = get_embedding(query_text, precision="ubinary") # Convert each embedding to BSON format query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32) query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8) query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT) # Define vector search pipeline for each precision pipelines = [] for query_embedding, path in zip( [query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings], ["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"] ): pipeline = [ { "$vectorSearch": { "index": "vector_index", # Adjust if necessary "queryVector": query_embedding, "path": path, "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "summary": 1, "score": { "$meta": "vectorSearchScore" } } } ] pipelines.append(pipeline) # Execute the search for each precision for pipeline in pipelines: print(f"Results for {pipeline[0]['$vectorSearch']['path']}:") results = collection.aggregate(pipeline) # Print results for i in results: print(i)
Results for BSON-Float32-Embedding: {'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.7847104072570801} {'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.', 'score': 0.7780507802963257} {'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", 'score': 0.7723637223243713} {'summary': 'Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.', 'score': 0.7665778398513794} {'summary': 'A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.', 'score': 0.7593404650688171} Results for BSON-Int8-Embedding: {'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.', 'score': 0.5} {'summary': 'One bedroom + sofa-bed in quiet and bucolic neighbourhood right next to the Botanical Garden. Small garden, outside shower, well equipped kitchen and bathroom with shower and tub. Easy for transport with many restaurants and basic facilities in the area.', 'score': 0.5} {'summary': "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", 'score': 0.5} {'summary': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway. The neighborhood is diverse and appeals to a variety of people.', 'score': 0.5} {'summary': 'Quarto com vista para a Lagoa Rodrigo de Freitas, cartão postal do Rio de Janeiro. Linda Vista. 1 Quarto e 1 banheiro Amplo, arejado, vaga na garagem. Prédio com piscina, sauna e playground. Fácil acesso, próximo da praia e shoppings.', 'score': 0.5} Results for BSON-Int1-Embedding: {'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.6901041865348816} {'summary': 'Cozy and comfortable apartment. Ideal for families and vacations. 3 bedrooms, 2 of them suites. Located 20-min walk to the beach and close to the Rio 2016 Olympics Venues. Situated in a modern and secure condominium, with many entertainment available options around.', 'score': 0.6731770634651184} {'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", 'score': 0.6731770634651184} {'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.', 'score': 0.671875} {'summary': 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!', 'score': 0.6705729365348816}
要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。
例如,此代码执行以下操作:
通过调用您定义的嵌入函数创建示例查询嵌入。
将嵌入传递到
queryVector
字段,并指定聚合管道中的搜索路径。指定管道中的
$vectorSearch
阶段以对嵌入执行 ENN搜索。对存储在集合中的嵌入运行示例向量搜索查询。
# Generate embedding for the search query query_embedding = get_embedding("beach house") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": True, "limit": 5 } }, { "$project": { "_id": 0, "summary": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359} {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646} {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605} {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792} {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
Considerations
在创建向量嵌入时请考虑以下因素:
选择创建嵌入的方法
要创建向量嵌入,您必须使用嵌入模型。嵌入模型是用于将数据转换为嵌入的算法。您可以选择以下方法之一连接到嵌入模型并创建向量嵌入:
方法 | 说明 |
---|---|
加载开源模型 | 如果您没有专有嵌入模型的 API 密钥,请从您的应用程序本地加载开源嵌入模型。 |
使用专有模型 | 大多数 AI 提供商都为其专有的嵌入模型提供 API,您可以使用这些模型创建向量嵌入。 |
利用一个集成 | 您可以将 Atlas Vector Search 与开源框架和 AI 服务集成,以快速连接到开源和专有嵌入模型,并为 Atlas Vector Search 生成矢量嵌入。 要了解更多信息,请参阅将 Vector Search 与 AI 技术集成。 |
选择内嵌模型
您选择的嵌入模型会影响您的查询结果,并决定您在 Atlas Vector Search 索引中指定的维度数。每种模型都提供不同的优势,具体取决于您的数据和使用案例。
有关常用嵌入模型的列表,请参阅海量文本嵌入基准 (MTEB) 。此列表提供了对各种开源和专有文本嵌入模型的深入见解,并允许您按使用案例、模型类型和特定模型指标来过滤模型。
为 Atlas Vector Search 选择嵌入模型时,请考虑以下指标:
验证您的嵌入
请考虑以下策略,以确保您的嵌入是正确且最佳的:
测试您的函数和脚本。
生成嵌入需要时间和计算资源。 在从大型数据集或集合创建嵌入之前,请测试嵌入函数或脚本是否在一小部分数据上按预期运行。
分批创建嵌入。
如果要从大型数据集或包含大量文档的集合生成嵌入,请分批创建嵌入,以避免内存问题并优化性能。
评估性能。
运行测试查询,检查搜索结果是否相关且排名是否准确。
故障排除
如果您在嵌入时遇到问题,请考虑以下策略:
验证您的环境。
检查是否已安装必要的依赖项且这些依赖项是最新的。 库版本冲突可能会导致意外行为。 创建新环境并仅安装所需的包,确保不存在冲突。
注意
如果使用 Colab,请确保笔记本会话的 IP 地址包含在 Atlas 项目的访问列表中。
监控内存使用情况。
如果您遇到性能问题,请检查 RAM、CPU 和磁盘使用情况,以确定任何潜在的瓶颈。对于 Colab 或 Jupyter Notebooks 等托管环境,请确保您的实例预配了足够的资源,并在必要时升级实例。
确保维度一致
验证Atlas Vector Search索引定义是否与Atlas中存储的嵌入维度相匹配,以及您的查询嵌入与索引嵌入的维度相匹配。 否则,在运行向量搜索查询时可能会遇到错误。
后续步骤
学会了如何创建嵌入和使用 Atlas Vector Search 查询嵌入之后,就可以通过实施检索增强生成 (RAG) 开始构建生成式人工智能应用:
您还可以将嵌入转换为BSON向量,以便在Atlas中高效存储和摄取向量。要学习;了解更多信息,请参阅如何摄取预量化向量。