Docs 菜单
Docs 主页
/
MongoDB Atlas
/

如何创建向量嵌入

在此页面上

  • 开始体验
  • 先决条件
  • 定义嵌入函数
  • 从数据创建嵌入
  • 为查询创建嵌入
  • Considerations
  • 选择创建嵌入的方法
  • 选择内嵌模型
  • 验证您的嵌入
  • 故障排除
  • 后续步骤

您可以将向量嵌入与其他数据一起存储在 Atlas 中。这些嵌入可以捕获数据中的有意义的关系,使您能够执行语义搜索并使用 Atlas Vector Search 实现 RAG

通过以下教程,学习;了解如何创建向量嵌入并使用Atlas Vector Search进行查询。 具体来说,您执行以下操作:

  1. 定义一个使用 嵌入模型生成向量嵌入的函数。

  2. 从您的数据中创建嵌入并将其存储在 Atlas 中。

  3. 根据搜索词创建嵌入,并运行向量搜索查询。

对于生产应用程序,您通常会写入脚本来生成向量嵌入。 您可以从此页面上的示例代码开始,并根据您的使用案例进行自定义。


➤ 使用 Select your language(选择您的语言)下拉菜单设置此页面上示例的语言。

提示

如要完成本教程,您必须具备以下条件:

  • 一个Atlas账户,其集群运行的是MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。

  • 用于运行C#项目的终端和代码编辑器。

  • 已安装.NET8.0 或更高版本。

  • Hugging Face 访问令牌或 OpenAI API 密钥。

  • 一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。

  • 以下之一:

    • 具有读取访问权限的“拥抱”脸部访问令牌

    • 一个 OpenAI API 密钥。您必须拥有付费的 OpenAI 账号,该账号具有可用于 API 请求的信用额度。要了解有关注册 OpenAI 账号的更多信息,请参阅 OpenAI API 网站

  • 一个Atlas账户,其集群运行MongoDB 6.0.11、7.0.2 或更高版本(包括 RC)。确保您的IP解决包含在Atlas项目的访问权限列表中。要学习;了解更多信息,请参阅创建集群。

  • 用于运行 Node.js 项目的终端和代码编辑器。

  • npm 和 Node.js 已安装。

  • 如果您使用 OpenAI 模型,则必须拥有 OpenAI API 密钥。

1

在终端窗口中,运行以下命令以初始化项目:

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings
2

在终端窗口中,运行以下命令:

dotnet add package MongoDB.Driver
3

导出环境变量,在 PowerShell 中 set 它们,或使用 IDE 的环境变量管理器,使连接字符串和 HugingFace访问权限令牌可供项目使用。

export HUGGINGFACE_ACCESS_TOKEN="<access-token>"
export ATLAS_CONNECTION_STRING="<connection-string>"

<access-token> 占位符值替换为您的 Huging Face访问权限令牌。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4

在名为AIService.cs 的同名文件中创建一个新类,然后粘贴以下代码。这段代码定义了一个名为GetEmbeddingsAsync 的异步任务,为一大量给定的字符串输入生成一大量嵌入。该函数使用 mxbai-embed-large-v1 嵌入模型。

AIService.cs
namespace MyCompany.Embeddings;
using System;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;
using System.Net.Http.Headers;
public class AIService
{
private static readonly string? HuggingFaceAccessToken = Environment.GetEnvironmentVariable("HUGGINGFACE_ACCESS_TOKEN");
private static readonly HttpClient Client = new HttpClient();
public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
{
const string modelName = "mixedbread-ai/mxbai-embed-large-v1";
const string url = $"https://api-inference.huggingface.co/models/{modelName}";
Client.DefaultRequestHeaders.Authorization
= new AuthenticationHeaderValue("Bearer", HuggingFaceAccessToken);
var data = new { inputs = texts };
var dataJson = JsonSerializer.Serialize(data);
var content = new StringContent(dataJson,null, "application/json");
var response = await Client.PostAsync(url, content);
response.EnsureSuccessStatusCode();
var responseString = await response.Content.ReadAsStringAsync();
var embeddings = JsonSerializer.Deserialize<float[][]>(responseString);
if (embeddings is null)
{
throw new ApplicationException("Failed to deserialize embeddings response to an array of floats.");
}
Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
var embeddingCount = embeddings.Length;
foreach (var value in Enumerable.Range(0, embeddingCount))
{
// Pair each embedding with the text used to generate it.
documentData[texts[value]] = embeddings[value];
}
return documentData;
}
}

注意

503 调用 Hushing Face 模型时

在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。

1

在终端窗口中,运行以下命令以初始化项目:

dotnet new console -o MyCompany.Embeddings
cd MyCompany.Embeddings
2

在终端窗口中,运行以下命令:

dotnet add package MongoDB.Driver
dotnet add package OpenAI
3

导出环境变量,在 PowerShell 中 set 环境变量,或使用 IDE 的环境变量管理器,使连接字符串和 HugingFace访问权限令牌可供您的项目使用。

export OPENAI_API_KEY="<api-key>"
export ATLAS_CONNECTION_STRING="<connection-string>"

<api-key> 占位符值替换为您的 OpenAI API密钥。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4

在名为 AIService.cs 的同名文件中创建一个新类,然后粘贴以下代码。这段代码定义了一个名为 GetEmbeddingsAsync 的异步任务,为一大量给定的字符串输入生成一大量嵌入。此函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

AIService.cs
namespace MyCompany.Embeddings;
using OpenAI.Embeddings;
using System;
using System.Threading.Tasks;
public class AIService
{
private static readonly string? OpenAIApiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY");
private static readonly string EmbeddingModelName = "text-embedding-3-small";
public async Task<Dictionary<string, float[]>> GetEmbeddingsAsync(string[] texts)
{
EmbeddingClient embeddingClient = new(model: EmbeddingModelName, apiKey: OpenAIApiKey);
Dictionary<string, float[]> documentData = new Dictionary<string, float[]>();
try
{
var result = await embeddingClient.GenerateEmbeddingsAsync(texts);
var embeddingCount = result.Value.Count;
foreach (var index in Enumerable.Range(0, embeddingCount))
{
// Pair each embedding with the text used to generate it.
documentData[texts[index]] = result.Value[index].ToFloats().ToArray();
}
}
catch (Exception e)
{
throw new ApplicationException(e.Message);
}
return documentData;
}
}

在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用开源嵌入模型还是 OpenAI 等专有模型来选择一个标签页。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project
2

在终端窗口中,运行以下命令:

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/mongo
go get github.com/tmc/langchaingo/llms
3

在项目中,创建 .env 文件来存储 Atlas 连接字符串和 Hugging Face 访问令牌。

HUGGINGFACEHUB_API_TOKEN = "<access-token>"
ATLAS_CONNECTION_STRING = "<connection-string>"

<access-token> 占位符值替换为您的 Huging Face访问权限令牌。

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4
  1. 在项目中创建一个名为 common 的目录,用于存储您将在后续步骤中使用的常用代码:

    mkdir common && cd common
  2. 创建一个名为 get-embeddings.go 的文件并粘贴以下代码。此代码定义了一个名为 GetEmbeddings 的函数,用于为给定输入生成嵌入。此函数指定:

    get-embeddings.go
    package common
    import (
    "context"
    "log"
    "github.com/tmc/langchaingo/embeddings/huggingface"
    )
    func GetEmbeddings(documents []string) [][]float32 {
    hf, err := huggingface.NewHuggingface(
    huggingface.WithModel("mixedbread-ai/mxbai-embed-large-v1"),
    huggingface.WithTask("feature-extraction"))
    if err != nil {
    log.Fatalf("failed to connect to Hugging Face: %v", err)
    }
    embs, err := hf.EmbedDocuments(context.Background(), documents)
    if err != nil {
    log.Fatalf("failed to generate embeddings: %v", err)
    }
    return embs
    }

    注意

    503 调用 Hushing Face 模型时

    在调用 Hugging Face 模型中心模型时,您偶尔可能会遇到 503 错误。要解决此问题,请在短暂等待后重试。

  3. 返回到主项目根目录。

    cd ../
1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
go mod init my-embeddings-project
2

在终端窗口中,运行以下命令:

go get github.com/joho/godotenv
go get go.mongodb.org/mongo-driver/mongo
go get github.com/milosgajdos/go-embeddings/openai
3

在您的项目中,创建一个.env string文件来存储连接字符串和 OpenAIAPI API令牌。

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key><connection-string> 占位符值。连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
4
  1. 在项目中创建一个名为 common 的目录,用于存储将在多个步骤中使用的代码:

    mkdir common && cd common
  2. 创建一个名为 get-embeddings.go 的文件并粘贴以下代码。此代码定义了一个名为 GetEmbeddings 的函数,该函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

    get-embeddings.go
    package common
    import (
    "context"
    "log"
    "github.com/milosgajdos/go-embeddings/openai"
    )
    func GetEmbeddings(docs []string) [][]float64 {
    c := openai.NewClient()
    embReq := &openai.EmbeddingRequest{
    Input: docs,
    Model: openai.TextSmallV3,
    EncodingFormat: openai.EncodingFloat,
    }
    embs, err := c.Embed(context.Background(), embReq)
    if err != nil {
    log.Fatalf("failed to connect to OpenAI: %v", err)
    }
    var vectors [][]float64
    for _, emb := range embs {
    vectors = append(vectors, emb.Vector)
    }
    return vectors
    }
  3. 返回到主项目根目录。

    cd ../

在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用开源嵌入模型还是 OpenAI 等专有模型来选择一个标签页。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1
  1. 在 IDE 中,使用 Maven 或 Gradle 创建Java项目。

  2. 根据您的包管理器,添加以下依赖项:

    如果使用 Maven,请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中:

    pom.xml
    <dependencies>
    <!-- MongoDB Java Sync Driver v5.2.0 or later -->
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>[5.2.0,)</version>
    </dependency>
    <!-- Java library for working with Hugging Face models -->
    <dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-hugging-face</artifactId>
    <version>0.35.0</version>
    </dependency>
    </dependencies>

    如果您使用 Gradle,请将以下内容添加到项目 build.gradle文件的 dependencies大量中:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.2.0 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
    // Java library for working with Hugging Face models
    implementation 'dev.langchain4j:langchain4j-hugging-face:0.35.0'
    }
  3. 运行包管理器以安装项目。

2

注意

此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。

在 IDE 中,创建新的配置模板并将以下变量添加到项目中:

  • 如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后在 Environment variables字段中将变量添加为分号分隔的值(示例FOO=123;BAR=456 )。应用更改并单击 OK

    要学习;了解更多信息,请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。

  • 如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK

    要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量
HUGGING_FACE_ACCESS_TOKEN=<access-token>
ATLAS_CONNECTION_STRING=<connection-string>

使用以下值更新占位符:

  • 将``<access-token>`` 占位符值替换为您的 Huging Face访问权限令牌。

  • 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

创建一个名为EmbeddingProvider.java的文件并粘贴以下代码。

此代码定义了两种使用 mxbai-embed-large-v1 开源嵌入模型为给定输入生成嵌入的方法:

  • getEmbeddings多个输入:List<String> 方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。

  • 单个输入:getEmbedding 方法接受单个String ,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。

EmbeddingProvider.java
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
private static HuggingFaceEmbeddingModel embeddingModel;
private static HuggingFaceEmbeddingModel getEmbeddingModel() {
if (embeddingModel == null) {
String accessToken = System.getenv("HUGGING_FACE_ACCESS_TOKEN");
if (accessToken == null || accessToken.isEmpty()) {
throw new RuntimeException("HUGGING_FACE_ACCESS_TOKEN env variable is not set or is empty.");
}
embeddingModel = HuggingFaceEmbeddingModel.builder()
.accessToken(accessToken)
.modelId("mixedbread-ai/mxbai-embed-large-v1")
.waitForModel(true)
.timeout(ofSeconds(60))
.build();
}
return embeddingModel;
}
/**
* Takes an array of strings and returns a BSON array of embeddings to
* store in the database.
*/
public List<BsonArray> getEmbeddings(List<String> texts) {
List<TextSegment> textSegments = texts.stream()
.map(TextSegment::from)
.toList();
Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
return response.content().stream()
.map(e -> new BsonArray(
e.vectorAsList().stream()
.map(BsonDouble::new)
.toList()))
.toList();
}
/**
* Takes a single string and returns a BSON array embedding to
* use in a vector query.
*/
public BsonArray getEmbedding(String text) {
Response<Embedding> response = getEmbeddingModel().embed(text);
return new BsonArray(
response.content().vectorAsList().stream()
.map(BsonDouble::new)
.toList());
}
}
1
  1. 在 IDE 中,使用 Maven 或 Gradle 创建Java项目。

  2. 根据您的包管理器,添加以下依赖项:

    如果使用 Maven,请将以下依赖项添加到项目的 pom.xml文件的 dependencies大量中:

    pom.xml
    <dependencies>
    <!-- MongoDB Java Sync Driver v5.2.0 or later -->
    <dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>[5.2.0,)</version>
    </dependency>
    <!-- Java library for working with OpenAI models -->
    <dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-open-ai</artifactId>
    <version>0.35.0</version>
    </dependency>
    </dependencies>

    如果您使用 Gradle,请将以下内容添加到项目 build.gradle文件的 dependencies大量中:

    build.gradle
    dependencies {
    // MongoDB Java Sync Driver v5.2.0 or later
    implementation 'org.mongodb:mongodb-driver-sync:[5.2.0,)'
    // Java library for working with OpenAI models
    implementation 'dev.langchain4j:langchain4j-open-ai:0.35.0'
    }
  3. 运行包管理器以安装项目。

2

注意

此示例在 IDE 中设置项目的变量。生产应用程序可以通过部署配置、CI/CD管道或密钥管理器来管理环境变量,但您可以调整提供的代码以适合您的使用案例。

在 IDE 中,创建新的配置模板并将以下变量添加到项目中:

  • 如果您使用的是 IntelliJ IDEA,请创建新的 Application运行配置模板,然后将变量作为分号分隔值添加到 Environment variables字段(示例FOO=123;BAR=456 )。应用更改并单击 OK

    要学习;了解更多信息,请参阅 IntelliJ IDEA 文档的从模板创建运行/调试配置部分。

  • 如果您使用的是 Eclipse,请创建新的 Java Application 启动配置,然后将每个变量作为新的键值对添加到 Environment标签页中。应用更改并单击 OK

    要学习;了解更多信息,请参阅 Eclipse IDE 文档的创建Java应用程序启动配置部分。

环境变量
OPEN_AI_API_KEY=<api-key>
ATLAS_CONNECTION_STRING=<connection-string>

使用以下值更新占位符:

  • 将``<api-key>`` 占位符值替换为您的 OpenAI API密钥。

  • 用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。

    连接字符串应使用以下格式:

    mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

创建一个名为EmbeddingProvider.java的文件并粘贴以下代码。

此代码定义了两种使用 text-embedding-3 -small OpenAI 嵌入模型为给定输入生成嵌入的方法:

  • getEmbeddings多个输入:List<String> 方法接受文本输入大量(),允许您在单个API调用中创建多个嵌入。该方法将API提供的浮点数数组转换为BSON双精度数组,以便存储在Atlas 集群中。

  • 单个输入:getEmbedding 方法接受单个String ,它表示要对向量数据进行的查询。该方法将API提供的浮点数大量转换为BSON双精度大量,以便在查询集合时使用。

EmbeddingProvider.java
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
import dev.langchain4j.model.output.Response;
import org.bson.BsonArray;
import org.bson.BsonDouble;
import java.util.List;
import static java.time.Duration.ofSeconds;
public class EmbeddingProvider {
private static OpenAiEmbeddingModel embeddingModel;
private static OpenAiEmbeddingModel getEmbeddingModel() {
if (embeddingModel == null) {
String apiKey = System.getenv("OPEN_AI_API_KEY");
if (apiKey == null || apiKey.isEmpty()) {
throw new IllegalStateException("OPEN_AI_API_KEY env variable is not set or is empty.");
}
return OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small")
.timeout(ofSeconds(60))
.build();
}
return embeddingModel;
}
/**
* Takes an array of strings and returns a BSON array of embeddings to
* store in the database.
*/
public List<BsonArray> getEmbeddings(List<String> texts) {
List<TextSegment> textSegments = texts.stream()
.map(TextSegment::from)
.toList();
Response<List<Embedding>> response = getEmbeddingModel().embedAll(textSegments);
return response.content().stream()
.map(e -> new BsonArray(
e.vectorAsList().stream()
.map(BsonDouble::new)
.toList()))
.toList();
}
/**
* Takes a single string and returns a BSON array embedding to
* use in a vector query.
*/
public BsonArray getEmbedding(String text) {
Response<Embedding> response = getEmbeddingModel().embed(text);
return new BsonArray(
response.content().vectorAsList().stream()
.map(BsonDouble::new)
.toList());
}
}

在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用开源嵌入模型还是 OpenAI 等专有模型来选择一个标签页。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y
2

将您的项目配置为使用 ES 模块 ,方法是将 "type": "module" 添加到 package.json 文件中,然后将其保存。

{
"type": "module",
// other fields...
}
3

在终端窗口中,运行以下命令:

npm install mongodb @xenova/transformers
4

在您的项目中,创建一个 .env 文件来存储您的 Atlas 连接字符串。

ATLAS_CONNECTION_STRING = "<connection-string>"

用 Atlas 集群的 SRV 连接字符串替换 <connection-string> 占位符值。连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

最低 Node.js 版本要求

Node.js v 20 .x 引入了 --env-file 选项。如果您使用的是旧版本的 Node.js,请将 dotenv 包添加到项目中,或使用其他方法来管理环境变量。

5

创建一个名为 get-embeddings.js 的文件并粘贴以下代码。这段代码定义了一个被命名用于为给定输入生成嵌入的函数。此函数指定:

get-embeddings.js
import { pipeline } from '@xenova/transformers';
// Function to generate embeddings for a given data source
export async function getEmbedding(data) {
const embedder = await pipeline(
'feature-extraction',
'Xenova/nomic-embed-text-v1');
const results = await embedder(data, { pooling: 'mean', normalize: true });
return Array.from(results.data);
}
1

在终端窗口中,运行以下命令,创建一个名为 my-embeddings-project的新目录并初始化项目:

mkdir my-embeddings-project
cd my-embeddings-project
npm init -y
2

将您的项目配置为使用 ES 模块 ,方法是将 "type": "module" 添加到 package.json 文件中,然后将其保存。

{
"type": "module",
// other fields...
}
3

在终端窗口中,运行以下命令:

npm install mongodb openai
4

在项目中,创建一个 .env 文件来存储 Atlas 连接字符串和 OpenAI API 密钥。

OPENAI_API_KEY = "<api-key>"
ATLAS_CONNECTION_STRING = "<connection-string>"

用 OpenAI API 密钥和 Atlas 集群的 SRV 连接字符串替换 <api-key><connection-string> 占位符值。连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

注意

最低 Node.js 版本要求

Node.js v 20 .x 引入了 --env-file 选项。如果您使用的是旧版本的 Node.js,请将 dotenv 包添加到项目中,或使用其他方法来管理环境变量。

5

创建一个名为 get-embeddings.js 的文件并粘贴以下代码。此代码定义了一个名为 getEmbedding 的函数,该函数使用 OpenAI 的 text-embedding-3-small 模型为给定输入生成嵌入。

get-embeddings.js
import OpenAI from 'openai';
// Setup OpenAI configuration
const openai = new OpenAI({apiKey: process.env.OPENAI_API_KEY});
// Function to get the embeddings using the OpenAI API
export async function getEmbedding(text) {
const results = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text,
encoding_format: "float",
});
return results.data[0].embedding;
}

在本部分中,您将定义一个函数以使用嵌入模型生成向量嵌入。根据您是要使用 Nomic 的开源嵌入模型还是 OpenAI 的专有模型来选择一个标签页。

该开源示例还包含一个函数,可将您的嵌入转换为BSONbinData 向量,以便进行高效处理。只有某些嵌入模型支持字节向量输出。对于未启用自动量化的嵌入模型(例如 OpenAI 中的模型),会在创建Atlas Vector Search索引时启用自动量化。

注意

开源嵌入模型可以免费使用,并且可以从您的应用程序本地加载。专有模型需要 API 密钥才能访问模型。

1

通过保存扩展名为 .ipynb 的文件来创建交互式 Python 笔记本,然后在笔记本中运行以下命令以安装依赖项:

!pip install --quiet --upgrade sentence-transformers pymongo einops
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.10.1
2

在笔记本中粘贴并运行以下代码,以创建一个函数,该函数使用 Nomic AI 的开源嵌入模型生成向量嵌入。此代码执行以下操作:

  • 加载 nomic-embed-text-v1 嵌入模型。

  • 创建一个名为 get_embedding 的函数,它使用模型生成 float32,它被设立为给定文本输入的默认精度、int8int1 嵌入。

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
# Load the embedding model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
# Define a function to generate embeddings in multiple precisions
def get_embedding(data, precision="float32"):
return model.encode(data, precision=precision)
3

在笔记本中粘贴并运行以下代码,创建一个使用PyMongo 驱动程序转换向量嵌入的函数。名为generate_bson_vector 的函数用于将全保真嵌入转换为BSONfloat32int8int1vector 子类型,以便高效处理向量数据。

from bson.binary import Binary
# Generate BSON vector using `BinaryVectorDtype`
def generate_bson_vector(vector, vector_dtype):
return Binary.from_vector(vector, vector_dtype)
4

在笔记本中粘贴并运行以下代码,以创建一个名为 create_docs_with_bson_vector_embeddings 的函数,该函数用于创建带有嵌入的文档。

from bson.binary import Binary
# Function to create documents with BSON vector embeddings
def create_docs_with_bson_vector_embeddings(bson_float32, bson_int8, bson_int1, data):
docs = []
for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32, bson_int8, bson_int1, data)):
doc = {
"_id": i,
"data": text,
"BSON-Float32-Embedding": bson_f32_emb,
"BSON-Int8-Embedding": bson_int8_emb,
"BSON-Int1-Embedding": bson_int1_emb,
}
docs.append(doc)
return docs
5

在笔记本中粘贴并运行以下代码,以测试使用 Nomic AI生成嵌入的函数。

此代码为字符串 foobar 生成 float32int8int1 嵌入。

# Example generating embeddings for the strings "foo" and "bar"
data = ["foo", "bar"]
float32_embeddings = get_embedding(data, "float32")
int8_embeddings = get_embedding(data, "int8")
int1_embeddings = get_embedding(data, "ubinary")
print("Float32 Embedding:", float32_embeddings)
print("Int8 Embedding:", int8_embeddings)
print("Int1 Embedding (binary representation):", int1_embeddings)
Float32 Embedding: [
[-0.02980827 0.03841474 -0.02561123 ... -0.0532876
-0.0335409 -0.02591543]
[-0.02748881 0.03717749 -0.03104552 ... 0.02413219 -0.02402252 0.02810651]
]
Int8 Embedding: [
[-128 127 127 ... -128 -128 -128]
[ 126 -128 -128 ... 127 126 127]
]
Int1 Embedding (binary representation): [
[ 77 30 4 131 15 123 146 ... 159 142 205 23 119 120]
[ 79 82 208 180 45 79 209 ... 158 100 141 189 166 173]
]
6

在笔记本中粘贴并运行以下代码,以测试使用PyMongo驾驶员将嵌入转换为BSON向量的函数。

此代码可对字符串 foobarfloat32int8int1 嵌入进行量化。

from bson.binary import Binary, BinaryVectorDtype
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
# Print the embeddings
print(f"The converted bson_float32_new_embedding is: {bson_float32_embeddings}")
print(f"The converted bson_int8_new_embedding is: {bson_int8_embeddings}")
print(f"The converted bson_int1_new_embedding is: {bson_int1_embeddings}")
The converted bson_float32_new_embedding is: [Binary(b'\'\x00x0\xf4\ ... x9bL\xd4\xbc', 9), Binary(b'\'\x007 ... \x9e?\xe6<', 9)]
The converted bson_int8_new_embedding is: [Binary(b'\x03\x00\x80\x7f\ ... x80\x80', 9), Binary(b'\x03\x00~\x80 ... \x7f', 9)]
The converted bson_int1_new_embedding is: [Binary(b'\x10\x00M\x1e\ ... 7wx', 9), Binary(b'\x10\x00OR\ ... \xa6\xad', 9)]
1

通过保存扩展名为 .ipynb 的文件来创建交互式 Python 笔记本,然后在笔记本中运行以下命令以安装依赖项:

pip install --quiet --upgrade openai pymongo
2

在笔记本中粘贴并运行以下代码,创建一个函数,该函数利用 OpenAI 的专有嵌入模型生成向量嵌入。用您的 OpenAI API 密钥替换 <api-key>。此代码执行以下操作:

  • 指定 text-embedding-3-small 内嵌模型。

  • 创建一个名为 get_embedding 的函数,该函数调用模型的 API 来为给定的文本输入生成嵌入。

  • 通过为字符串 foo 生成单个嵌入来测试函数。

import os
from openai import OpenAI
# Specify your OpenAI API key and embedding model
os.environ["OPENAI_API_KEY"] = "<api-key>"
model = "text-embedding-3-small"
openai_client = OpenAI()
# Define a function to generate embeddings
def get_embedding(text):
"""Generates vector embeddings for the given text."""
embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding
return embedding
# Generate an embedding
get_embedding("foo")
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]

提示

另请参阅:

有关 API 详细信息和可用模型列表,请参阅 OpenAI 文档。

在本部分中,您将使用您定义的函数从数据创建向量嵌入,然后存储这些嵌入存储在Atlas的集合中。

根据要从新数据还是从 Atlas 中已有的现有数据创建嵌入,选择一个标签页。

1

在名为DataService.cs 的同名文件中创建一个新类,然后粘贴以下代码。此代码定义了一个名为AddDocumentsAsync add documents to Atlas 的异步任务。此函数使用 Collection.InsertManyAsync() C#驱动程序方法插入BsonDocument 类型的列表。每个文档包含:

  • 包含电影摘要的 text字段。

  • 一个 embedding字段,其中包含生成向量嵌入的浮点数大量。

DataService.cs
namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
private static readonly MongoClient Client = new MongoClient(ConnectionString);
private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
{
var documents = new List<BsonDocument>();
foreach( KeyValuePair<string, float[]> var in embeddings )
{
var document = new BsonDocument
{
{
"text", var.Key
},
{
"embedding", new BsonArray(var.Value)
}
};
documents.Add(document);
}
await Collection.InsertManyAsync(documents);
Console.WriteLine($"Successfully inserted {embeddings.Count} documents into Atlas");
documents.Clear();
}
}
2

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,此代码使用您定义的 GetEmbeddingsAsync 函数从示例文本大量生成嵌入,并将其摄取到Atlas中的 sample_db.embeddings集合中。

Program.cs
using MyCompany.Embeddings;
var aiService = new AIService();
var texts = new string[]
{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
};
var embeddings = await aiService.GetEmbeddingsAsync(texts);
var dataService = new DataService();
await dataService.AddDocumentsAsync(embeddings);
3
dotnet run MyCompany.Embeddings.csproj
Successfully inserted 3 documents into Atlas

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

在名为 DataService.cs 的同名文件中创建一个新类,然后粘贴以下代码。此代码创建两个执行以下操作的函数:

  • 连接到您的 Atlas 集群。

  • GetDocuments 方法从 sample_airbnb.listingsAndReviews集合中获取具有非空 summary字段的文档子集。

  • AddEmbeddings异步任务在 sample_airbnb.listingsAndReviews集合中的文档上创建一个新的 embeddings字段,该集合的 _idGetDocuments 方法中检索到的文档之一匹配。

DataService.cs
namespace MyCompany.Embeddings;
using MongoDB.Driver;
using MongoDB.Bson;
public class DataService
{
private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
private static readonly MongoClient Client = new MongoClient(ConnectionString);
private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
public List<BsonDocument>? GetDocuments()
{
var filter = Builders<BsonDocument>.Filter.And(
Builders<BsonDocument>.Filter.And(
Builders<BsonDocument>.Filter.Exists("summary", true),
Builders<BsonDocument>.Filter.Ne("summary", "")
),
Builders<BsonDocument>.Filter.Exists("embeddings", false)
);
return Collection.Find(filter).Limit(50).ToList();
}
public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
{
var listWrites = new List<WriteModel<BsonDocument>>();
foreach( var kvp in embeddings )
{
var filterForUpdate = Builders<BsonDocument>.Filter.Eq("summary", kvp.Key);
var updateDefinition = Builders<BsonDocument>.Update.Set("embeddings", kvp.Value);
listWrites.Add(new UpdateOneModel<BsonDocument>(filterForUpdate, updateDefinition));
}
var result = await Collection.BulkWriteAsync(listWrites);
listWrites.Clear();
return result.ModifiedCount;
}
}
2

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,此代码使用您定义的 GetEmbeddingsAsync 函数从示例文本大量生成嵌入,并将其摄取到Atlas中的 sample_db.embeddings集合中。

Program.cs
using MyCompany.Embeddings;
var dataService = new DataService();
var documents = dataService.GetDocuments();
if (documents != null)
{
Console.WriteLine("Generating embeddings.");
var aiService = new AIService();
var summaries = new List<string>();
foreach (var document in documents)
{
var summary = document.GetValue("summary").ToString();
if (summary != null)
{
summaries.Add(summary);
}
}
try
{
if (summaries.Count > 0)
{
var embeddings = await aiService.GetEmbeddingsAsync(summaries.ToArray());
try
{
var updatedCount = await dataService.AddEmbeddings(embeddings);
Console.WriteLine($"{updatedCount} documents updated successfully.");
} catch (Exception e)
{
Console.WriteLine($"Error adding embeddings to MongoDB: {e.Message}");
}
}
}
catch (Exception e)
{
Console.WriteLine($"Error creating embeddings for summaries: {e.Message}");
}
}
else
{
Console.WriteLine("No documents found");
}
3
dotnet run MyCompany.Embeddings.csproj
Generating embeddings.
50 documents updated successfully.
1

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,这段代码使用您定义的 GetEmbeddings 函数和 MongoDB Go 驱动程序,从示例文本数组生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

create-embeddings.go
package main
import (
"context"
"fmt"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
var data = []string{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
Text string
Embedding []float32
}
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(ctx, clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_db").Collection("embeddings")
embeddings := common.GetEmbeddings(data)
docsToInsert := make([]interface{}, len(embeddings))
for i, string := range data {
docsToInsert[i] = TextWithEmbedding{
Text: string,
Embedding: embeddings[i],
}
}
result, err := coll.InsertMany(ctx, docsToInsert)
if err != nil {
log.Fatalf("failed to insert documents: %v", err)
}
fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}
create-embeddings.go
package main
import (
"context"
"fmt"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
var data = []string{
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
}
type TextWithEmbedding struct {
Text string
Embedding []float64
}
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(ctx, clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_db").Collection("embeddings")
embeddings := common.GetEmbeddings(data)
docsToInsert := make([]interface{}, len(data))
for i, movie := range data {
docsToInsert[i] = TextWithEmbedding{
Text: movie,
Embedding: embeddings[i],
}
}
result, err := coll.InsertMany(ctx, docsToInsert)
if err != nil {
log.Fatalf("failed to insert documents: %v", err)
}
fmt.Printf("Successfully inserted %v documents into Atlas\n", len(result.InsertedIDs))
}
2
go run create-embeddings.go
Successfully inserted 3 documents into Atlas
go run create-embeddings.go
Successfully inserted 3 documents into Atlas

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:

  • 连接到您的 Atlas 集群。

  • sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。

  • 使用您定义的 GetEmbeddings 函数,从每个文档的 summary 字段生成嵌入。

  • 使用 MongoDB Go 驱动程序,用包含嵌入值的新 embeddings 字段更新每个文档。

create-embeddings.go
package main
import (
"context"
"log"
"my-embeddings-project/common"
"os"
"github.com/joho/godotenv"
"go.mongodb.org/mongo-driver/bson"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)
func main() {
ctx := context.Background()
if err := godotenv.Load(); err != nil {
log.Println("no .env file found")
}
// Connect to your Atlas cluster
uri := os.Getenv("ATLAS_CONNECTION_STRING")
if uri == "" {
log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
}
clientOptions := options.Client().ApplyURI(uri)
client, err := mongo.Connect(ctx, clientOptions)
if err != nil {
log.Fatalf("failed to connect to the server: %v", err)
}
defer func() { _ = client.Disconnect(ctx) }()
// Set the namespace
coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
filter := bson.D{
{"$and",
bson.A{
bson.D{
{"$and",
bson.A{
bson.D{{"summary", bson.D{{"$exists", true}}}},
bson.D{{"summary", bson.D{{"$ne", ""}}}},
},
}},
bson.D{{"embeddings", bson.D{{"$exists", false}}}},
}},
}
opts := options.Find().SetLimit(50)
cursor, err := coll.Find(ctx, filter, opts)
if err != nil {
log.Fatalf("failed to retrieve documents: %v", err)
}
var listings []common.Listing
if err = cursor.All(ctx, &listings); err != nil {
log.Fatalf("failed to unmarshal retrieved documents to Listing object: %v", err)
}
var summaries []string
for _, listing := range listings {
summaries = append(summaries, listing.Summary)
}
log.Println("Generating embeddings.")
embeddings := common.GetEmbeddings(summaries)
docsToUpdate := make([]mongo.WriteModel, len(listings))
for i := range listings {
docsToUpdate[i] = mongo.NewUpdateOneModel().
SetFilter(bson.D{{"_id", listings[i].ID}}).
SetUpdate(bson.D{{"$set", bson.D{{"embeddings", embeddings[i]}}}})
}
bulkWriteOptions := options.BulkWrite().SetOrdered(false)
result, err := coll.BulkWrite(context.Background(), docsToUpdate, bulkWriteOptions)
if err != nil {
log.Fatalf("failed to write embeddings to existing documents: %v", err)
}
log.Printf("Successfully added embeddings to %v documents", result.ModifiedCount)
}
2

为了简化与BSON之间的编组和解组Go对象,请创建一个包含此集合中文档模型的文件。

  1. 进入 common目录。

    cd common
  2. 创建一个名为 models.go 的文件,并将以下代码粘贴到其中:

    models.go
    package common
    import (
    "time"
    "go.mongodb.org/mongo-driver/bson/primitive"
    )
    type Image struct {
    ThumbnailURL string `bson:"thumbnail_url"`
    MediumURL string `bson:"medium_url"`
    PictureURL string `bson:"picture_url"`
    XLPictureURL string `bson:"xl_picture_url"`
    }
    type Host struct {
    ID string `bson:"host_id"`
    URL string `bson:"host_url"`
    Name string `bson:"host_name"`
    Location string `bson:"host_location"`
    About string `bson:"host_about"`
    ThumbnailURL string `bson:"host_thumbnail_url"`
    PictureURL string `bson:"host_picture_url"`
    Neighborhood string `bson:"host_neighborhood"`
    IsSuperhost bool `bson:"host_is_superhost"`
    HasProfilePic bool `bson:"host_has_profile_pic"`
    IdentityVerified bool `bson:"host_identity_verified"`
    ListingsCount int32 `bson:"host_listings_count"`
    TotalListingsCount int32 `bson:"host_total_listings_count"`
    Verifications []string `bson:"host_verifications"`
    }
    type Location struct {
    Type string `bson:"type"`
    Coordinates []float64 `bson:"coordinates"`
    IsLocationExact bool `bson:"is_location_exact"`
    }
    type Address struct {
    Street string `bson:"street"`
    Suburb string `bson:"suburb"`
    GovernmentArea string `bson:"government_area"`
    Market string `bson:"market"`
    Country string `bson:"Country"`
    CountryCode string `bson:"country_code"`
    Location Location `bson:"location"`
    }
    type Availability struct {
    Thirty int32 `bson:"availability_30"`
    Sixty int32 `bson:"availability_60"`
    Ninety int32 `bson:"availability_90"`
    ThreeSixtyFive int32 `bson:"availability_365"`
    }
    type ReviewScores struct {
    Accuracy int32 `bson:"review_scores_accuracy"`
    Cleanliness int32 `bson:"review_scores_cleanliness"`
    CheckIn int32 `bson:"review_scores_checkin"`
    Communication int32 `bson:"review_scores_communication"`
    Location int32 `bson:"review_scores_location"`
    Value int32 `bson:"review_scores_value"`
    Rating int32 `bson:"review_scores_rating"`
    }
    type Review struct {
    ID string `bson:"_id"`
    Date time.Time `bson:"date,omitempty"`
    ListingId string `bson:"listing_id"`
    ReviewerId string `bson:"reviewer_id"`
    ReviewerName string `bson:"reviewer_name"`
    Comments string `bson:"comments"`
    }
    type Listing struct {
    ID string `bson:"_id"`
    ListingURL string `bson:"listing_url"`
    Name string `bson:"name"`
    Summary string `bson:"summary"`
    Space string `bson:"space"`
    Description string `bson:"description"`
    NeighborhoodOverview string `bson:"neighborhood_overview"`
    Notes string `bson:"notes"`
    Transit string `bson:"transit"`
    Access string `bson:"access"`
    Interaction string `bson:"interaction"`
    HouseRules string `bson:"house_rules"`
    PropertyType string `bson:"property_type"`
    RoomType string `bson:"room_type"`
    BedType string `bson:"bed_type"`
    MinimumNights string `bson:"minimum_nights"`
    MaximumNights string `bson:"maximum_nights"`
    CancellationPolicy string `bson:"cancellation_policy"`
    LastScraped time.Time `bson:"last_scraped,omitempty"`
    CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"`
    FirstReview time.Time `bson:"first_review,omitempty"`
    LastReview time.Time `bson:"last_review,omitempty"`
    Accommodates int32 `bson:"accommodates"`
    Bedrooms int32 `bson:"bedrooms"`
    Beds int32 `bson:"beds"`
    NumberOfReviews int32 `bson:"number_of_reviews"`
    Bathrooms primitive.Decimal128 `bson:"bathrooms"`
    Amenities []string `bson:"amenities"`
    Price primitive.Decimal128 `bson:"price"`
    WeeklyPrice primitive.Decimal128 `bson:"weekly_price"`
    MonthlyPrice primitive.Decimal128 `bson:"monthly_price"`
    CleaningFee primitive.Decimal128 `bson:"cleaning_fee"`
    ExtraPeople primitive.Decimal128 `bson:"extra_people"`
    GuestsIncluded primitive.Decimal128 `bson:"guests_included"`
    Image Image `bson:"images"`
    Host Host `bson:"host"`
    Address Address `bson:"address"`
    Availability Availability `bson:"availability"`
    ReviewScores ReviewScores `bson:"review_scores"`
    Reviews []Review `bson:"reviews"`
    Embeddings []float32 `bson:"embeddings,omitempty"`
    }
    models.go
    package common
    import (
    "time"
    "go.mongodb.org/mongo-driver/bson/primitive"
    )
    type Image struct {
    ThumbnailURL string `bson:"thumbnail_url"`
    MediumURL string `bson:"medium_url"`
    PictureURL string `bson:"picture_url"`
    XLPictureURL string `bson:"xl_picture_url"`
    }
    type Host struct {
    ID string `bson:"host_id"`
    URL string `bson:"host_url"`
    Name string `bson:"host_name"`
    Location string `bson:"host_location"`
    About string `bson:"host_about"`
    ThumbnailURL string `bson:"host_thumbnail_url"`
    PictureURL string `bson:"host_picture_url"`
    Neighborhood string `bson:"host_neighborhood"`
    IsSuperhost bool `bson:"host_is_superhost"`
    HasProfilePic bool `bson:"host_has_profile_pic"`
    IdentityVerified bool `bson:"host_identity_verified"`
    ListingsCount int32 `bson:"host_listings_count"`
    TotalListingsCount int32 `bson:"host_total_listings_count"`
    Verifications []string `bson:"host_verifications"`
    }
    type Location struct {
    Type string `bson:"type"`
    Coordinates []float64 `bson:"coordinates"`
    IsLocationExact bool `bson:"is_location_exact"`
    }
    type Address struct {
    Street string `bson:"street"`
    Suburb string `bson:"suburb"`
    GovernmentArea string `bson:"government_area"`
    Market string `bson:"market"`
    Country string `bson:"Country"`
    CountryCode string `bson:"country_code"`
    Location Location `bson:"location"`
    }
    type Availability struct {
    Thirty int32 `bson:"availability_30"`
    Sixty int32 `bson:"availability_60"`
    Ninety int32 `bson:"availability_90"`
    ThreeSixtyFive int32 `bson:"availability_365"`
    }
    type ReviewScores struct {
    Accuracy int32 `bson:"review_scores_accuracy"`
    Cleanliness int32 `bson:"review_scores_cleanliness"`
    CheckIn int32 `bson:"review_scores_checkin"`
    Communication int32 `bson:"review_scores_communication"`
    Location int32 `bson:"review_scores_location"`
    Value int32 `bson:"review_scores_value"`
    Rating int32 `bson:"review_scores_rating"`
    }
    type Review struct {
    ID string `bson:"_id"`
    Date time.Time `bson:"date,omitempty"`
    ListingId string `bson:"listing_id"`
    ReviewerId string `bson:"reviewer_id"`
    ReviewerName string `bson:"reviewer_name"`
    Comments string `bson:"comments"`
    }
    type Listing struct {
    ID string `bson:"_id"`
    ListingURL string `bson:"listing_url"`
    Name string `bson:"name"`
    Summary string `bson:"summary"`
    Space string `bson:"space"`
    Description string `bson:"description"`
    NeighborhoodOverview string `bson:"neighborhood_overview"`
    Notes string `bson:"notes"`
    Transit string `bson:"transit"`
    Access string `bson:"access"`
    Interaction string `bson:"interaction"`
    HouseRules string `bson:"house_rules"`
    PropertyType string `bson:"property_type"`
    RoomType string `bson:"room_type"`
    BedType string `bson:"bed_type"`
    MinimumNights string `bson:"minimum_nights"`
    MaximumNights string `bson:"maximum_nights"`
    CancellationPolicy string `bson:"cancellation_policy"`
    LastScraped time.Time `bson:"last_scraped,omitempty"`
    CalendarLastScraped time.Time `bson:"calendar_last_scraped,omitempty"`
    FirstReview time.Time `bson:"first_review,omitempty"`
    LastReview time.Time `bson:"last_review,omitempty"`
    Accommodates int32 `bson:"accommodates"`
    Bedrooms int32 `bson:"bedrooms"`
    Beds int32 `bson:"beds"`
    NumberOfReviews int32 `bson:"number_of_reviews"`
    Bathrooms primitive.Decimal128 `bson:"bathrooms"`
    Amenities []string `bson:"amenities"`
    Price primitive.Decimal128 `bson:"price"`
    WeeklyPrice primitive.Decimal128 `bson:"weekly_price"`
    MonthlyPrice primitive.Decimal128 `bson:"monthly_price"`
    CleaningFee primitive.Decimal128 `bson:"cleaning_fee"`
    ExtraPeople primitive.Decimal128 `bson:"extra_people"`
    GuestsIncluded primitive.Decimal128 `bson:"guests_included"`
    Image Image `bson:"images"`
    Host Host `bson:"host"`
    Address Address `bson:"address"`
    Availability Availability `bson:"availability"`
    ReviewScores ReviewScores `bson:"review_scores"`
    Reviews []Review `bson:"reviews"`
    Embeddings []float64 `bson:"embeddings,omitempty"`
    }
  3. 返回到项目根目录。

    cd ../
3
go run create-embeddings.go
2024/10/10 09:58:03 Generating embeddings.
2024/10/10 09:58:12 Successfully added embeddings to 50 documents

您可以通过导航到 Atlas UI 中的 sample_airbnb.listingsAndReviews 集合来查看生成的矢量嵌入。

1

创建一个名为CreateEmbeddings.java的文件并粘贴以下代码。

此代码使用getEmbeddings 方法和MongoDB Java同步驱动程序来执行以下操作:

  1. 连接到您的 Atlas 集群。

  2. 获取示例文本大量。

  3. 使用您之前定义的 getEmbeddings 方法从每个文本生成嵌入。

  4. 将嵌入引入Atlas中的 sample_db.embeddings集合。

CreateEmbeddings.java
import com.mongodb.MongoException;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.result.InsertManyResult;
import org.bson.BsonArray;
import org.bson.Document;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class CreateEmbeddings {
static List<String> data = Arrays.asList(
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
);
public static void main(String[] args){
String uri = System.getenv("ATLAS_CONNECTION_STRING");
if (uri == null || uri.isEmpty()) {
throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
}
// establish connection and set namespace
try (MongoClient mongoClient = MongoClients.create(uri)) {
MongoDatabase database = mongoClient.getDatabase("sample_db");
MongoCollection<Document> collection = database.getCollection("embeddings");
System.out.println("Creating embeddings for " + data.size() + " documents");
EmbeddingProvider embeddingProvider = new EmbeddingProvider();
// generate embeddings for new inputted data
List<BsonArray> embeddings = embeddingProvider.getEmbeddings(data);
List<Document> documents = new ArrayList<>();
int i = 0;
for (String text : data) {
Document doc = new Document("text", text).append("embedding", embeddings.get(i));
documents.add(doc);
i++;
}
// insert the embeddings into the Atlas collection
List<String> insertedIds = new ArrayList<>();
try {
InsertManyResult result = collection.insertMany(documents);
result.getInsertedIds().values()
.forEach(doc -> insertedIds.add(doc.toString()));
System.out.println("Inserted " + insertedIds.size() + " documents with the following ids to " + collection.getNamespace() + " collection: \n " + insertedIds);
} catch (MongoException me) {
throw new RuntimeException("Failed to insert documents", me);
}
} catch (MongoException me) {
throw new RuntimeException("Failed to connect to MongoDB ", me);
} catch (Exception e) {
throw new RuntimeException("Operation failed: ", e);
}
}
}
2

保存并运行该文件。 输出类似如下所示:

Creating embeddings for 3 documents
Inserted 3 documents with the following ids to sample_db.embeddings collection:
[BsonObjectId{value=6735ff620d88451041f6dd40}, BsonObjectId{value=6735ff620d88451041f6dd41}, BsonObjectId{value=6735ff620d88451041f6dd42}]

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

创建一个名为CreateEmbeddings.java的文件并粘贴以下代码。

此代码使用getEmbeddings 方法和MongoDB Java同步驱动程序来执行以下操作:

  1. 连接到您的 Atlas 集群。

  2. sample_airbnb.listingsAndReviews集合中获取具有非空 summary字段的文档子集。

  3. 使用您之前定义的 getEmbeddings 方法,从每个文档的 summary字段生成嵌入。

  4. 使用包含嵌入值的新 embeddings字段更新每个文档。

CreateEmbeddings.java
import com.mongodb.MongoException;
import com.mongodb.bulk.BulkWriteResult;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.BulkWriteOptions;
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.UpdateOneModel;
import com.mongodb.client.model.Updates;
import com.mongodb.client.model.WriteModel;
import org.bson.BsonArray;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.ArrayList;
import java.util.List;
public class CreateEmbeddings {
public static void main(String[] args){
String uri = System.getenv("ATLAS_CONNECTION_STRING");
if (uri == null || uri.isEmpty()) {
throw new RuntimeException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
}
// establish connection and set namespace
try (MongoClient mongoClient = MongoClients.create(uri)) {
MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
Bson filterCriteria = Filters.and(
Filters.and(Filters.exists("summary"),
Filters.ne("summary", null),
Filters.ne("summary", "")),
Filters.exists("embeddings", false));
try (MongoCursor<Document> cursor = collection.find(filterCriteria).limit(50).iterator()) {
List<String> summaries = new ArrayList<>();
List<String> documentIds = new ArrayList<>();
int i = 0;
while (cursor.hasNext()) {
Document document = cursor.next();
String summary = document.getString("summary");
String id = document.get("_id").toString();
summaries.add(summary);
documentIds.add(id);
i++;
}
System.out.println("Generating embeddings for " + summaries.size() + " documents.");
System.out.println("This operation may take up to several minutes.");
EmbeddingProvider embeddingProvider = new EmbeddingProvider();
List<BsonArray> embeddings = embeddingProvider.getEmbeddings(summaries);
List<WriteModel<Document>> updateDocuments = new ArrayList<>();
for (int j = 0; j < summaries.size(); j++) {
UpdateOneModel<Document> updateDoc = new UpdateOneModel<>(
Filters.eq("_id", documentIds.get(j)),
Updates.set("embeddings", embeddings.get(j)));
updateDocuments.add(updateDoc);
}
int updatedDocsCount = 0;
try {
BulkWriteOptions options = new BulkWriteOptions().ordered(false);
BulkWriteResult result = collection.bulkWrite(updateDocuments, options);
updatedDocsCount = result.getModifiedCount();
} catch (MongoException me) {
throw new RuntimeException("Failed to insert documents", me);
}
System.out.println("Added embeddings successfully to " + updatedDocsCount + " documents.");
}
} catch (MongoException me) {
throw new RuntimeException("Failed to connect to MongoDB", me);
} catch (Exception e) {
throw new RuntimeException("Operation failed: ", e);
}
}
}
2

保存并运行该文件。 输出类似如下所示:

Generating embeddings for 50 documents.
This operation may take up to several minutes.
Added embeddings successfully to 50 documents.

您还可以导航到集群中的 sample_airbnb.listingsAndReviews 集合,在 Atlas UI 中查看向量嵌入。

1

使用以下代码从 Atlas 中的现有集合生成嵌入。

具体来说,这段代码使用您定义的 getEmbedding 函数和 MongoDB Node.js 驱动程序,从示例文本数组中生成嵌入,并将其摄取到 Atlas 中的 sample_db.embeddings 集合中。

create-embeddings.js
import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// Data to embed
const data = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
async function run() {
// Connect to your Atlas cluster
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
try {
await client.connect();
const db = client.db("sample_db");
const collection = db.collection("embeddings");
await Promise.all(data.map(async text => {
// Check if the document already exists
const existingDoc = await collection.findOne({ text: text });
// Generate an embedding by using the function that you defined
const embedding = await getEmbedding(text);
// Ingest data and embedding into Atlas
if (!existingDoc) {
await collection.insertOne({
text: text,
embedding: embedding
});
console.log(embedding);
}
}));
} catch (err) {
console.log(err.stack);
}
finally {
await client.close();
}
}
run().catch(console.dir);
2
node --env-file=.env create-embeddings.js
[ -0.04323853924870491, -0.008460805751383305, 0.012494648806750774, -0.013014335185289383, ... ]
[ -0.017400473356246948, 0.04922063276171684, -0.002836339408531785, -0.030395228415727615, ... ]
[ -0.016950927674770355, 0.013881809078156948, -0.022074559703469276, -0.02838018536567688, ... ]
node --env-file=.env create-embeddings.js
[ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ]
[ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ]
[ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ]

注意

为了便于阅读,输出中的维数已被截断。

您还可以导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

使用以下代码从 Atlas 中的现有集合生成嵌入。具体而言,此代码执行以下操作:

  • 连接到您的 Atlas 集群。

  • sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。

  • 使用您定义的 getEmbedding 函数,从每个文档的 summary 字段生成嵌入。

  • 使用 MongoDB Node.js 驱动程序,用包含嵌入值的新 embedding 字段更新每个文档。

create-embeddings.js
import { MongoClient } from 'mongodb';
import { getEmbedding } from './get-embeddings.js';
// Connect to your Atlas cluster
const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
async function run() {
try {
await client.connect();
const db = client.db("sample_airbnb");
const collection = db.collection("listingsAndReviews");
// Filter to exclude null or empty summary fields
const filter = { "summary": { "$nin": [ null, "" ] } };
// Get a subset of documents from the collection
const documents = await collection.find(filter).limit(50).toArray();
// Create embeddings from a field in the collection
let updatedDocCount = 0;
console.log("Generating embeddings for documents...");
await Promise.all(documents.map(async doc => {
// Generate an embedding by using the function that you defined
const embedding = await getEmbedding(doc.summary);
// Update the document with a new embedding field
await collection.updateOne({ "_id": doc._id },
{
"$set": {
"embedding": embedding
}
}
);
updatedDocCount += 1;
}));
console.log("Count of documents updated: " + updatedDocCount);
} catch (err) {
console.log(err.stack);
}
finally {
await client.close();
}
}
run().catch(console.dir);
2
node --env-file=.env create-embeddings.js
Generating embeddings for documents...
Count of documents updated: 50

sample_airbnb.listingsAndReviews您可以导航到Atlas用户界面中的 集合并展开文档中的字段,以查看生成的向量嵌入。

1

如果您尚未在笔记本中加载get_embedding generate_bson_vector、 和create_docs_with_bson_vector_embeddings 函数,请参阅定义嵌入函数以在笔记本中加载这些函数。

2

将以下代码粘贴到您的笔记本中并运行:

# Sample data
sentences = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission",
"Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.",
"The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.",
"Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.",
"Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.",
"The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.",
"Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.",
"The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.",
"Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.",
"The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.",
"Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.",
"The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.",
"E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.",
"Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.",
"Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.",
"Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.",
"Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.",
"Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory."
]
3

请使用以下代码从新数据生成嵌入。

具体来说,此代码使用您定义的get_embedding 函数和PyMongo驱动程序从示例文本大量生成嵌入。

import pymongo
from sentence_transformers.quantization import quantize_embeddings
float32_embeddings = get_embedding(sentences, precision="float32")
int8_embeddings = get_embedding(sentences, precision="int8")
int1_embeddings = get_embedding(sentences, precision="ubinary")
# Print stored embeddings
print("Generated embeddings stored in different variables:")
for i, text in enumerate(sentences):
print(f"\nText: {text}")
print(f"Float32 Embedding: {float32_embeddings[i][:3]}... (truncated)")
print(f"Int8 Embedding: {int8_embeddings[i][:3]}... (truncated)")
print(f"Ubinary Embedding: {int1_embeddings[i][:3]}... (truncated)")
Generated embeddings stored in different variables:
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Float32 Embedding: [-0.01089042 0.05926645 -0.00291325]... (truncated)
Int8 Embedding: [-15 127 56]... (truncated)
Ubinary Embedding: [ 77 30 209]... (truncated)
Text: The Lion King: Lion cub and future king Simba searches for his identity
Float32 Embedding: [-0.05607051 -0.01360618 0.00523855]... (truncated)
Int8 Embedding: [-128 -109 110]... (truncated)
Ubinary Embedding: [ 37 18 151]... (truncated)
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Float32 Embedding: [-0.0275258 0.01144342 -0.02360895]... (truncated)
Int8 Embedding: [-57 -28 -79]... (truncated)
Ubinary Embedding: [ 76 16 144]... (truncated)
Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.
Float32 Embedding: [-0.01759741 0.03254957 -0.02090798]... (truncated)
Int8 Embedding: [-32 40 -61]... (truncated)
Ubinary Embedding: [ 77 27 176]... (truncated)
Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.
Float32 Embedding: [ 0.00503172 0.04311579 -0.00074904]... (truncated)
Int8 Embedding: [23 74 70]... (truncated)
Ubinary Embedding: [215 26 145]... (truncated)
Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.
Float32 Embedding: [0.02349479 0.05669326 0.00458773]... (truncated)
Int8 Embedding: [ 69 118 105]... (truncated)
Ubinary Embedding: [237 154 159]... (truncated)
Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.
Float32 Embedding: [-0.03294644 0.02671233 -0.01864981]... (truncated)
Int8 Embedding: [-70 21 -47]... (truncated)
Ubinary Embedding: [ 77 90 146]... (truncated)
Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.
Float32 Embedding: [-0.02489671 0.02847196 -0.00290637]... (truncated)
Int8 Embedding: [-50 27 56]... (truncated)
Ubinary Embedding: [ 95 154 129]... (truncated)
Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.
Float32 Embedding: [-0.01235448 0.01524397 -0.01063425]... (truncated)
Int8 Embedding: [-19 -15 5]... (truncated)
Ubinary Embedding: [ 68 26 210]... (truncated)
Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.
Float32 Embedding: [ 0.04665203 0.01392298 -0.01743002]... (truncated)
Int8 Embedding: [127 -20 -39]... (truncated)
Ubinary Embedding: [207 88 208]... (truncated)
Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.
Float32 Embedding: [0.00929601 0.04206405 0.00701248]... (truncated)
Int8 Embedding: [ 34 71 121]... (truncated)
Ubinary Embedding: [228 90 130]... (truncated)
Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.
Float32 Embedding: [-0.01451324 -0.00897367 0.0077793 ]... (truncated)
Int8 Embedding: [-24 -94 127]... (truncated)
Ubinary Embedding: [ 57 150 32]... (truncated)
Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.
Float32 Embedding: [-0.01458643 0.03639758 -0.02587282]... (truncated)
Int8 Embedding: [-25 52 -94]... (truncated)
Ubinary Embedding: [ 78 218 216]... (truncated)
Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.
Float32 Embedding: [-0.00205381 -0.00039482 -0.01630799]... (truncated)
Int8 Embedding: [ 6 -66 -31]... (truncated)
Ubinary Embedding: [ 9 82 154]... (truncated)
Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.
Float32 Embedding: [ 0.01105334 0.00776658 -0.03092942]... (truncated)
Int8 Embedding: [ 38 -40 -128]... (truncated)
Ubinary Embedding: [205 24 146]... (truncated)
Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.
Float32 Embedding: [ 0.00266668 -0.01926583 -0.00727963]... (truncated)
Int8 Embedding: [ 17 -128 27]... (truncated)
Ubinary Embedding: [148 82 194]... (truncated)
Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.
Float32 Embedding: [-0.00031873 -0.01352339 -0.02882693]... (truncated)
Int8 Embedding: [ 10 -109 -114]... (truncated)
Ubinary Embedding: [ 12 26 144]... (truncated)
Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.
Float32 Embedding: [ 0.00957429 0.01855557 -0.02353773]... (truncated)
Int8 Embedding: [ 34 -5 -79]... (truncated)
Ubinary Embedding: [212 18 144]... (truncated)
Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.
Float32 Embedding: [-0.01787405 0.03672816 -0.00972007]... (truncated)
Int8 Embedding: [-33 53 11]... (truncated)
Ubinary Embedding: [ 68 154 145]... (truncated)
Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory.
Float32 Embedding: [-0.03515214 -0.00503571 0.00183181]... (truncated)
Int8 Embedding: [-76 -81 87]... (truncated)
Ubinary Embedding: [ 35 222 152]... (truncated)
4

使用以下代码将生成的向量嵌入转换为BSON向量。该代码使用PyMongo驾驶员。

具体来说,此代码将生成的嵌入转换为 float32int8 和位封装的 int1 类型,然后量化 float32int8int1 向量。

from bson.binary import Binary, BinaryVectorDtype
bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []
# Convert each embedding to BSON
for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))
# Print the embeddings
for idx, text in enumerate(sentences):
print(f"\nText: {text}")
print(f"Float32 BSON: {bson_float32_embeddings[idx]}")
print(f"Int8 BSON: {bson_int8_embeddings[idx]}")
print(f"Int1 BSON: {bson_int1_embeddings[idx]}")
Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
Float32 BSON: b'\'\x00\xbam2\xbc`\xc1r=7\xec>\xbb\xe6\xf3\x...'
Int8 BSON: b'\x03\x00\xf1\x7f8\xdf\xfeC\x1e>\xef\xd6\xf5\x9...'
Int1 BSON: b'\x10\x00M\x1e\xd1\xd2\x05\xaeq\xdf\x9a\x1d\xbc...'
Text: The Lion King: Lion cub and future king Simba searches for his identity
Float32 BSON: b'\'\x001\xaae\xbdr\xec^\xbc"\xa8\xab;\x91\xd...'
Int8 BSON: b'\x03\x00\x80\x93n\x06\x80\xca\xd3.\xa2\xe3\xd1...'
Int1 BSON: b'\x10\x00%\x12\x97\xa6\x8f\xdf\x89\x9d2\xcb\x99...'
Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
Float32 BSON: b'\'\x00\xcc}\xe1\xbc-};<\x8eg\xc1\xbc\xcb\xd...'
Int8 BSON: b'\x03\x00\xc7\xe4\xb1\xdf/\xe2\xd2\x90\xf7\x02|...'
Int1 BSON: b'\x10\x00L\x10\x90\xb6\x0f\x8a\x91\xaf\x92|\xf9...'
Text: Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.
Float32 BSON: b'\'\x00o(\x90\xbc\xb3R\x05=8G\xab\xbc\xfb\xc...'
Int8 BSON: b'\x03\x00\xe0(\xc3\x10*\xda\xfe\x19\xbf&<\xd1\x...'
Int1 BSON: b'\x10\x00M\x1b\xb0\x86\rn\x93\xaf:w\x9f}\x92\xd...'
Text: The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.
Float32 BSON: b'\'\x00\x1d\xe1\xa4;0\x9a0=C[D\xba\xb5\xf2\x...'
Int8 BSON: b'\x03\x00\x17JF2\xb9\xddZ8\xa1\x0c\xc6\x80\xd8$...'
Int1 BSON: b'\x10\x00\xd7\x1a\x91\x87\x0e\xc9\x91\x8b\xba\x...'
Text: Forrest Gump: A man with a low IQ recounts several decades of extraordinary events in his life.
Float32 BSON: b'\'\x00#x\xc0<27h=\xb5T\x96;:\xc4\x9c\xbd\x1...'
Int8 BSON: b'\x03\x00Evi\x80\x13\xd6\x1cCW\x80\x01\x9e\xe58...'
Int1 BSON: b'\x10\x00\xed\x9a\x9f\x97\x1f.\x12\xf9\xba];\x7...'
Text: Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.
Float32 BSON: b'\'\x00\xd9\xf2\x06\xbd\xd2\xd3\xda<\x7f\xc7...'
Int8 BSON: b'\x03\x00\xba\x15\xd1-\x0c\x03\xe6\xea\rQ\x1f\x...'
Int1 BSON: b'\x10\x00MZ\x92\xb7#\xaa\x99=\x9a\x99\x9c|<\xf8...'
Text: The Matrix: A hacker discovers the true nature of reality and his role in the war against its controllers.
Float32 BSON: b'\'\x00/\xf4\xcb\xbc\t>\xe9<\xc9x>\xbb\xcc\x...'
Int8 BSON: b'\x03\x00\xce\x1b815\xcf1\xc6s\xe5\n\xe4\x192G\...'
Int1 BSON: b'\x10\x00_\x9a\x81\xa6\x0f\x0f\x93o2\xd8\xfe|\x...'
Text: Star Wars: A young farm boy is swept into the struggle between the Rebel Alliance and the Galactic Empire.
Float32 BSON: b'\'\x00sjJ\xbc\xd6\xc1y<I;.\xbc\xb1\x80\t\xb...'
Int8 BSON: b'\x03\x00\xed\xf1\x05\xe2\xc7\xfa\xd4\xab5\xeb\...'
Int1 BSON: b'\x10\x00D\x1a\xd2\x86\x0ey\x92\x8f\xaa\x89\x1c...'
Text: The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.
Float32 BSON: b'\'\x004\x16?=9\x1dd<g\xc9\x8e\xbc\xdf\x81\x...'
Int8 BSON: b'\x03\x00\x7f\xec\xd9\xdc)\xd6)\x05\x18\x7f\xa6...'
Int1 BSON: b"\x10\x00\xcfX\xd0\xb7\x0e\xcf\xd9\r\xf0U\xb4]6..."
Text: Indiana Jones and the Last Crusade: An archaeologist pursues the Holy Grail while confronting adversaries from the past.
Float32 BSON: b'\'\x00HN\x18<ZK,=\xf9\xc8\xe5;\x9e\xed\xa0\...'
Int8 BSON: b'\x03\x00"Gy\x01\xeb\xec\xfc\x80\xe4a\x7f\x88\x...'
Int1 BSON: b'\x10\x00\xe4Z\x82\xb6\xad\xec\x10-\x9a\x99;?j\...'
Text: The Dark Knight: Batman faces a new menace, the Joker, who plunges Gotham into anarchy.
Float32 BSON: b'\'\x00\xef\xc8m\xbcJ\x06\x13\xbcv\xe9\xfe;...'
Int8 BSON: b'\x03\x00\xe8\xa2\x7fIE\xba\x9f\xfaT2\xf1\xc1\...'
Int1 BSON: b'\x10\x009\x96 \xb7\x8e\xc9\x81\xaf\xaa\x9f\xa...'
Text: Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.
Float32 BSON: b'\'\x00\xee\xfbn\xbc\xa0\x15\x15=<\xf3\xd3x...'
Int8 BSON: b'\x03\x00\xe74\xa2\xe5\x15\x165\xb9dM8C\xd7E\x...'
Int1 BSON: b'\x10\x00N\xda\xd8\xb6\x03N\x98\xbd\xdaY\x1b| ...'
Text: The Silence of the Lambs: A young FBI agent seeks the help of an incarcerated cannibalistic killer to catch another serial killer.
Float32 BSON: b'\'\x002\x99\x06\xbb\x82\x00\xcf\xb9X\x98\x...'
Int8 BSON: b'\x03\x00\x06\xbe\xe1.\x7f\x80\x04C\xd7e\x80\x...'
Int1 BSON: b'\x10\x00\tR\x9a\xd6\x0c\xb1\x9a\xbc\x90\xf5\x...'
Text: E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.
Float32 BSON: b'\'\x00\x14\x195<\xd4~\xfe;\xb3_\xfd\xbc \xe...'
Int8 BSON: b'\x03\x00&\xd8\x80\x92\x01\x7f\xbfF\xd4\x10\xf0...'
Int1 BSON: b'\x10\x00\xcd\x18\x92\x92\x8dJ\x92\xbd\x9a\xd3\...'
Text: Saving Private Ryan: During WWII, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.
Float32 BSON: b'\'\x00\x8a\xc3.;_\xd3\x9d\xbc\xf2\x89\xee\x...'
Int8 BSON: b'\x03\x00\x11\x80\x1b5\xe9\x19\x80\x8f\xb1N\xda...'
Int1 BSON: b"\x10\x00\x94R\xc2\xd2\x0f\xfa\x90\xbc\xd8\xd6\...'
Text: Gladiator: A once-powerful Roman general seeks vengeance against the corrupt emperor who betrayed his family.
Float32 BSON: b'\'\x00\xe6\x1a\xa7\xb94\x91]\xbcs&\xec\xbc\...'
Int8 BSON: b'\x03\x00\n\x93\x8e,n\xce\xe8\x9b@\x00\xf9\x7f\...'
Int1 BSON: b'\x10\x00\x0c\x1a\x90\x97\x0f\x19\x80/\xba\x98\...'
Text: Rocky: A small-time boxer gets a once-in-a-lifetime chance to fight the world heavyweight champion.
Float32 BSON: b'\'\x00\x7f\xdd\x1c<\xd9\x01\x98<1\xd2\xc0\xb...'
Int8 BSON: b'\x03\x00"\xfb\xb1\x7f\xd3\xd6\x04\xbe\x80\xf9L\...'
Int1 BSON: b'\x10\x00\xd4\x12\x90\xa6\x8by\x99\x8d\xa2\xbd\x...'
Text: Pirates of the Caribbean: Jack Sparrow races to recover the heart of Davy Jones to escape eternal servitude.
Float32 BSON: b'\'\x00\x98l\x92\xbcDp\x16=\xf0@\x1f\xbc\xd0\...'
Int8 BSON: b'\x03\x00\xdf5\x0b\xe3\xbf\xe5\xa5\xad\x7f\x02\x...'
Int1 BSON: b'\x10\x00D\x9a\x91\x96\x07\xfa\x93\x8d\xb2D\x92]...'
Text: Schindler's List: The true story of a man who saved hundreds of Jews during the Holocaust by employing them in his factory.
Float32 BSON: b'\'\x00\xb0\xfb\x0f\xbd\x9b\x02\xa5\xbbZ\x19\...'
Int8 BSON: b'\x03\x00\xb4\xafW\xd9\xd7\xc3\x7f~QM\x86\x83\xf...'
Int1 BSON: b'\x10\x00#\xde\x98\x96\x0e\xcc\x12\xf6\xbb\xdd2}...'
5

使用以下代码创建具有BSON向量嵌入的文档。该代码使用 create_docs_with_bson_vector_embeddings函数创建文档。

# Create BSON documents
docs = create_docs_with_bson_vector_embeddings(bson_float32_embeddings, bson_int8_embeddings, bson_int1_embeddings, sentences)
6

在笔记本中粘贴以下代码,将<connection-string> 替换为Atlas集群的 SRV连接字符串,然后运行代码。

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_db"]
collection = db["embeddings"]
# Ingest data into Atlas
collection.insert_many(docs)
InsertManyResult([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], acknowledged=True)

您可以在集群sample_db.embeddings 命名空间空间内的Atlas用户用户界面中查看向量嵌入,以验证向量嵌入。

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

请使用以下代码从新数据生成嵌入。

具体来说,这段代码使用您定义的 get_embedding 函数和 MongoDB PyMongo 驱动程序,从示例文本数组中生成嵌入,并将其摄取到 sample_db.embeddings 集合中。

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_db"]
collection = db["embeddings"]
# Sample data
data = [
"Titanic: The story of the 1912 sinking of the largest luxury liner ever built",
"The Lion King: Lion cub and future king Simba searches for his identity",
"Avatar: A marine is dispatched to the moon Pandora on a unique mission"
]
# Ingest data into Atlas
inserted_doc_count = 0
for text in data:
embedding = get_embedding(text)
collection.insert_one({ "text": text, "embedding": embedding })
inserted_doc_count += 1
print(f"Inserted {inserted_doc_count} documents.")
Inserted 3 documents.
2

<connection-string> 替换为 Atlas 集群的 SRV 连接字符串。

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

您可以通过导航到集群中的 sample_db.embeddings 集合,在 Atlas UI 中查看向量嵌入来验证它们。

1

如果您尚未在笔记本中加载get_embeddinggenerate_bson_vector 函数,请参阅定义嵌入函数以在笔记本中加载这些函数。

2

从Atlas 集群获取数据并将其加载到笔记本中。以下代码从 sample_airbnb.listingAndReviews集合的 50 文档中获取 summary字段。

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]
# Define a filter to exclude documents with null or empty 'summary' fields
summary_filter = {"summary": {"$nin": [None, ""]}}
# Get a subset of documents in the collection
documents = collection.find(summary_filter, {'_id': 1, 'summary': 1}).limit(50)
3

以下代码首先为数据生成float32int8int1 嵌入,然后将这些嵌入转换为BSONfloat32int8int1 子类型,最后将这些嵌入上传到Atlas 集群。

注意

该操作可能需要几分钟才能完成。

import pymongo
from bson.binary import Binary
updated_doc_count = 0
for doc in documents:
summary = doc["summary"]
# Generate float32 and other embeddings
float32_embeddings = get_embedding(summary, precision="float32")
int8_embeddings = get_embedding(summary, precision="int8")
int1_embeddings = get_embedding(summary, precision="ubinary")
# Initialize variables to store BSON vectors
bson_float32_embeddings = generate_bson_vector(float32_embeddings, BinaryVectorDtype.FLOAT32)
bson_int8_embeddings = generate_bson_vector(int8_embeddings, BinaryVectorDtype.INT8)
bson_int1_embeddings = generate_bson_vector(int1_embeddings, BinaryVectorDtype.PACKED_BIT)
# Update each document with new embedding fields
collection.update_one(
{"_id": doc["_id"]},
{"$set": {
"BSON-Float32-Embedding": bson_float32_embeddings,
"BSON-Int8-Embedding": bson_int8_embeddings,
"BSON-Int1-Embedding": bson_int1_embeddings
}},
)
updated_doc_count += 1
print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.

注意

此示例使用示例数据中的 sample_airbnb.listingsAndReviews 集合,但您可以调整代码以适用于集群中的任何集合。

1

使用以下代码从现有集合中的字段生成嵌入。具体而言,此代码执行以下操作:

  • 连接到您的 Atlas 集群。

  • sample_airbnb.listingsAndReviews 集合中获取具有非空 summary 字段的文档子集。

  • 使用您定义的 get_embedding 函数,从每个文档的 summary 字段生成嵌入。

  • 使用 MongoDB PyMongo 驱动程序,用包含嵌入值的新 embedding 字段更新每个文档。

import pymongo
# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]
# Filter to exclude null or empty summary fields
filter = { "summary": {"$nin": [ None, "" ]} }
# Get a subset of documents in the collection
documents = collection.find(filter).limit(50)
# Update each document with a new embedding field
updated_doc_count = 0
for doc in documents:
embedding = get_embedding(doc["summary"])
collection.update_one( { "_id": doc["_id"] }, { "$set": { "embedding": embedding } } )
updated_doc_count += 1
print(f"Updated {updated_doc_count} documents.")
Updated 50 documents.
2

<connection-string> 替换为 Atlas 集群的 SRV 连接字符串。

注意

连接字符串应使用以下格式:

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
3

sample_airbnb.listingsAndReviews您可以导航到Atlas用户界面中的 集合并展开文档中的字段,以查看生成的向量嵌入。

在本节中,您将对集合中的向量嵌入进行索引,并创建一个嵌入,用于运行示例向量搜索查询。

运行查询时,Atlas Vector Search 会返回嵌入距离与向量搜索查询中的嵌入距离最接近的文档。这表明它们的含义相似。

1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 粘贴以下代码以将 CreateVectorIndex 函数添加到 DataService.cs 中的 DataService 类。

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    try
    {
    var searchIndexView = Collection.SearchIndexes;
    var name = "vector_index";
    var type = SearchIndexType.VectorSearch;
    var definition = new BsonDocument
    {
    { "fields", new BsonArray
    {
    new BsonDocument
    {
    { "type", "vector" },
    { "path", "embedding" },
    { "numDimensions", <dimensions> },
    { "similarity", "dotProduct" }
    }
    }
    }
    };
    var model = new CreateSearchIndexModel(name, type, definition);
    searchIndexView.CreateOne(model);
    Console.WriteLine($"New search index named {name} is building.");
    // Polling for index status
    Console.WriteLine("Polling to check if the index is ready. This may take up to a minute.");
    bool queryable = false;
    while (!queryable)
    {
    var indexes = searchIndexView.List();
    foreach (var index in indexes.ToEnumerable())
    {
    if (index["name"] == name)
    {
    queryable = index["queryable"].AsBoolean;
    }
    }
    if (!queryable)
    {
    Thread.Sleep(5000);
    }
    }
    Console.WriteLine($"{name} is ready for querying.");
    }
    catch (Exception e)
    {
    Console.WriteLine($"Exception: {e.Message}");
    }
    }
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 更新 Program.cs 中的代码。

    删除填充初始文档的代码,并将其替换为以下代码以创建索引:

    Program.cs
    using MyCompany.Embeddings;
    var dataService = new DataService();
    dataService.CreateVectorIndex();
  4. 保存文件,然后编译并运行项目以创建索引:

    dotnet run MyCompany.Embeddings
    New search index named vector_index is building.
    Polling to check if the index is ready. This may take up to a minute.
    vector_index is ready for querying.

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 粘贴以下代码以将 PerformVectorQuery 函数添加到 DataService.cs 中的 DataService 类。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_db");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("embeddings");
    public async Task AddDocumentsAsync(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    // Method details...
    }
    public List<BsonDocument>? PerformVectorQuery(float[] vector)
    {
    var vectorSearchStage = new BsonDocument
    {
    {
    "$vectorSearch",
    new BsonDocument
    {
    { "index", "vector_index" },
    { "path", "embedding" },
    { "queryVector", new BsonArray(vector) },
    { "exact", true },
    { "limit", 5 }
    }
    }
    };
    var projectStage = new BsonDocument
    {
    {
    "$project",
    new BsonDocument
    {
    { "_id", 0 },
    { "text", 1 },
    { "score",
    new BsonDocument
    {
    { "$meta", "vectorSearchScore"}
    }
    }
    }
    }
    };
    var pipeline = new[] { vectorSearchStage, projectStage };
    return Collection.Aggregate<BsonDocument>(pipeline).ToList();
    }
    }
  2. 更新 Program.cs 中的代码。

    删除创建向量索引的代码,并添加代码以执行查询:

    Program.cs
    using MongoDB.Bson;
    using MyCompany.Embeddings;
    var aiService = new AIService();
    var queryString = "ocean tragedy";
    var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]);
    if (!queryEmbedding.Any())
    {
    Console.WriteLine("No embeddings found.");
    }
    else
    {
    var dataService = new DataService();
    var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]);
    if (matchingDocuments == null)
    {
    Console.WriteLine("No documents matched the query.");
    }
    else
    {
    foreach (var document in matchingDocuments)
    {
    Console.WriteLine(document.ToJson());
    }
    }
    }
  3. 保存文件,然后编译并运行项目以执行查询:

    dotnet run MyCompany.Embeddings.csproj
    { "text" : "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "score" : 100.17414855957031 }
    { "text" : "Avatar: A marine is dispatched to the moon Pandora on a unique mission", "score" : 65.705635070800781 }
    { "text" : "The Lion King: Lion cub and future king Simba searches for his identity", "score" : 52.486415863037109 }
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embeddings 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 粘贴以下代码以将 CreateVectorIndex 函数添加到 DataService.cs 中的 DataService 类。

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
    // Method details...
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    try
    {
    var searchIndexView = Collection.SearchIndexes;
    var name = "vector_index";
    var type = SearchIndexType.VectorSearch;
    var definition = new BsonDocument
    {
    { "fields", new BsonArray
    {
    new BsonDocument
    {
    { "type", "vector" },
    { "path", "embeddings" },
    { "numDimensions", <dimensions> },
    { "similarity", "dotProduct" }
    }
    }
    }
    };
    var model = new CreateSearchIndexModel(name, type, definition);
    searchIndexView.CreateOne(model);
    Console.WriteLine($"New search index named {name} is building.");
    // Polling for index status
    Console.WriteLine("Polling to check if the index is ready. This may take up to a minute.");
    bool queryable = false;
    while (!queryable)
    {
    var indexes = searchIndexView.List();
    foreach (var index in indexes.ToEnumerable())
    {
    if (index["name"] == name)
    {
    queryable = index["queryable"].AsBoolean;
    }
    }
    if (!queryable)
    {
    Thread.Sleep(5000);
    }
    }
    Console.WriteLine($"{name} is ready for querying.");
    }
    catch (Exception e)
    {
    Console.WriteLine($"Exception: {e.Message}");
    }
    }
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 更新 Program.cs 中的代码。

    删除向现有文档添加嵌入的代码,并将其替换为以下代码以创建索引:

    Program.cs
    using MyCompany.Embeddings;
    var dataService = new DataService();
    dataService.CreateVectorIndex();
  4. 保存文件,然后编译并运行项目以创建索引:

    dotnet run MyCompany.Embeddings
    New search index named vector_index is building.
    Polling to check if the index is ready. This may take up to a minute.
    vector_index is ready for querying.

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 粘贴以下代码以将 PerformVectorQuery 函数添加到 DataService.cs 中的 DataService 类。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    DataService.cs
    namespace MyCompany.Embeddings;
    using MongoDB.Driver;
    using MongoDB.Bson;
    public class DataService
    {
    private static readonly string? ConnectionString = Environment.GetEnvironmentVariable("ATLAS_CONNECTION_STRING");
    private static readonly MongoClient Client = new MongoClient(ConnectionString);
    private static readonly IMongoDatabase Database = Client.GetDatabase("sample_airbnb");
    private static readonly IMongoCollection<BsonDocument> Collection = Database.GetCollection<BsonDocument>("listingsAndReviews");
    public List<BsonDocument>? GetDocuments()
    {
    // Method details...
    }
    public async Task<long> AddEmbeddings(Dictionary<string, float[]> embeddings)
    {
    // Method details...
    }
    public void CreateVectorIndex()
    {
    // Method details...
    }
    public List<BsonDocument>? PerformVectorQuery(float[] vector)
    {
    var vectorSearchStage = new BsonDocument
    {
    {
    "$vectorSearch",
    new BsonDocument
    {
    { "index", "vector_index" },
    { "path", "embeddings" },
    { "queryVector", new BsonArray(vector) },
    { "exact", true },
    { "limit", 5 }
    }
    }
    };
    var projectStage = new BsonDocument
    {
    {
    "$project",
    new BsonDocument
    {
    { "_id", 0 },
    { "summary", 1 },
    { "score",
    new BsonDocument
    {
    { "$meta", "vectorSearchScore"}
    }
    }
    }
    }
    };
    var pipeline = new[] { vectorSearchStage, projectStage };
    return Collection.Aggregate<BsonDocument>(pipeline).ToList();
    }
    }
  2. 更新 Program.cs 中的代码。

    删除创建向量索引的代码,并添加代码以执行查询:

    Program.cs
    using MongoDB.Bson;
    using MyCompany.Embeddings;
    var aiService = new AIService();
    var queryString = "beach house";
    var queryEmbedding = await aiService.GetEmbeddingsAsync([queryString]);
    if (!queryEmbedding.Any())
    {
    Console.WriteLine("No embeddings found.");
    }
    else
    {
    var dataService = new DataService();
    var matchingDocuments = dataService.PerformVectorQuery(queryEmbedding[queryString]);
    if (matchingDocuments == null)
    {
    Console.WriteLine("No documents matched the query.");
    }
    else
    {
    foreach (var document in matchingDocuments)
    {
    Console.WriteLine(document.ToJson());
    }
    }
    }
  3. 保存文件,然后编译并运行项目以执行查询:

    dotnet run MyCompany.Embeddings.csproj
    { "summary" : "Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.", "score" : 88.884147644042969 }
    { "summary" : "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score" : 86.136398315429688 }
    { "summary" : "Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.", "score" : 86.087783813476562 }
    { "summary" : "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score" : 85.689559936523438 }
    { "summary" : "Fully furnished 3+1 flat decorated with vintage style. Located at the heart of Moda/Kadıköy, close to seaside and also to the public transportation (tram, metro, ferry, bus stations) 10 minutes walk.", "score" : 85.614166259765625 }
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 创建一个名为 create-index.go 的文件并粘贴以下代码。

    create-index.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "os"
    "time"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    func main() {
    ctx := context.Background()
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_db").Collection("embeddings")
    indexName := "vector_index"
    opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch")
    type vectorDefinitionField struct {
    Type string `bson:"type"`
    Path string `bson:"path"`
    NumDimensions int `bson:"numDimensions"`
    Similarity string `bson:"similarity"`
    }
    type vectorDefinition struct {
    Fields []vectorDefinitionField `bson:"fields"`
    }
    indexModel := mongo.SearchIndexModel{
    Definition: vectorDefinition{
    Fields: []vectorDefinitionField{{
    Type: "vector",
    Path: "embedding",
    NumDimensions: <dimensions>,
    Similarity: "dotProduct"}},
    },
    Options: opts,
    }
    log.Println("Creating the index.")
    searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel)
    if err != nil {
    log.Fatalf("failed to create the search index: %v", err)
    }
    // Await the creation of the index.
    log.Println("Polling to confirm successful index creation.")
    searchIndexes := coll.SearchIndexes()
    var doc bson.Raw
    for doc == nil {
    cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName))
    if err != nil {
    fmt.Errorf("failed to list search indexes: %w", err)
    }
    if !cursor.Next(ctx) {
    break
    }
    name := cursor.Current.Lookup("name").StringValue()
    queryable := cursor.Current.Lookup("queryable").Boolean()
    if name == searchIndexName && queryable {
    doc = cursor.Current
    } else {
    time.Sleep(5 * time.Second)
    }
    }
    log.Println("Name of Index Created: " + searchIndexName)
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    go run create-index.go
    2024/10/09 17:38:51 Creating the index.
    2024/10/09 17:38:52 Polling to confirm successful index creation.
    2024/10/09 17:39:22 Name of Index Created: vector_index

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.go 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    type TextAndScore struct {
    Text string `bson:"text"`
    Score float32 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_db").Collection("embeddings")
    query := "ocean tragedy"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embedding"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"text", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []TextAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score)
    }
    }
    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    type TextAndScore struct {
    Text string `bson:"text"`
    Score float64 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_db").Collection("embeddings")
    query := "ocean tragedy"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embedding"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"text", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []TextAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to TextAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Text: %v\nScore: %v\n", doc.Text, doc.Score)
    }
    }
  2. 保存文件,然后运行以下命令:

    go run vector-query.go
    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.0042472864
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.0031167597
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.0024476869
    go run vector-query.go
    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.4552372694015503
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.4050072133541107
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.35942140221595764
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embeddings 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 创建一个名为 create-index.go 的文件并粘贴以下代码。

    create-index.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "os"
    "time"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    func main() {
    ctx := context.Background()
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
    indexName := "vector_index"
    opts := options.SearchIndexes().SetName(indexName).SetType("vectorSearch")
    type vectorDefinitionField struct {
    Type string `bson:"type"`
    Path string `bson:"path"`
    NumDimensions int `bson:"numDimensions"`
    Similarity string `bson:"similarity"`
    }
    type vectorDefinition struct {
    Fields []vectorDefinitionField `bson:"fields"`
    }
    indexModel := mongo.SearchIndexModel{
    Definition: vectorDefinition{
    Fields: []vectorDefinitionField{{
    Type: "vector",
    Path: "embeddings",
    NumDimensions: <dimensions>,
    Similarity: "dotProduct",
    Quantization: "scalar"}},
    },
    Options: opts,
    }
    log.Println("Creating the index.")
    searchIndexName, err := coll.SearchIndexes().CreateOne(ctx, indexModel)
    if err != nil {
    log.Fatalf("failed to create the search index: %v", err)
    }
    // Await the creation of the index.
    log.Println("Polling to confirm successful index creation.")
    searchIndexes := coll.SearchIndexes()
    var doc bson.Raw
    for doc == nil {
    cursor, err := searchIndexes.List(ctx, options.SearchIndexes().SetName(searchIndexName))
    if err != nil {
    fmt.Errorf("failed to list search indexes: %w", err)
    }
    if !cursor.Next(ctx) {
    break
    }
    name := cursor.Current.Lookup("name").StringValue()
    queryable := cursor.Current.Lookup("queryable").Boolean()
    if name == searchIndexName && queryable {
    doc = cursor.Current
    } else {
    time.Sleep(5 * time.Second)
    }
    }
    log.Println("Name of Index Created: " + searchIndexName)
    }
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 1024;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    go run create-index.go
    2024/10/10 10:03:12 Creating the index.
    2024/10/10 10:03:13 Polling to confirm successful index creation.
    2024/10/10 10:03:44 Name of Index Created: vector_index

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.go 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    type SummaryAndScore struct {
    Summary string `bson:"summary"`
    Score float32 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
    query := "beach house"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embeddings"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"summary", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []SummaryAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score)
    }
    }
    vector-query.go
    package main
    import (
    "context"
    "fmt"
    "log"
    "my-embeddings-project/common"
    "os"
    "github.com/joho/godotenv"
    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
    )
    type SummaryAndScore struct {
    Summary string `bson:"summary"`
    Score float64 `bson:"score"`
    }
    func main() {
    ctx := context.Background()
    // Connect to your Atlas cluster
    if err := godotenv.Load(); err != nil {
    log.Println("no .env file found")
    }
    // Connect to your Atlas cluster
    uri := os.Getenv("ATLAS_CONNECTION_STRING")
    if uri == "" {
    log.Fatal("set your 'ATLAS_CONNECTION_STRING' environment variable.")
    }
    clientOptions := options.Client().ApplyURI(uri)
    client, err := mongo.Connect(ctx, clientOptions)
    if err != nil {
    log.Fatalf("failed to connect to the server: %v", err)
    }
    defer func() { _ = client.Disconnect(ctx) }()
    // Set the namespace
    coll := client.Database("sample_airbnb").Collection("listingsAndReviews")
    query := "beach house"
    queryEmbedding := common.GetEmbeddings([]string{query})
    pipeline := mongo.Pipeline{
    bson.D{
    {"$vectorSearch", bson.D{
    {"queryVector", queryEmbedding[0]},
    {"index", "vector_index"},
    {"path", "embeddings"},
    {"exact", true},
    {"limit", 5},
    }},
    },
    bson.D{
    {"$project", bson.D{
    {"_id", 0},
    {"summary", 1},
    {"score", bson.D{
    {"$meta", "vectorSearchScore"},
    }},
    }},
    },
    }
    // Run the pipeline
    cursor, err := coll.Aggregate(ctx, pipeline)
    if err != nil {
    log.Fatalf("failed to run aggregation: %v", err)
    }
    defer func() { _ = cursor.Close(ctx) }()
    var matchingDocs []SummaryAndScore
    if err = cursor.All(ctx, &matchingDocs); err != nil {
    log.Fatalf("failed to unmarshal results to SummaryAndScore objects: %v", err)
    }
    for _, doc := range matchingDocs {
    fmt.Printf("Summary: %v\nScore: %v\n", doc.Summary, doc.Score)
    }
    }
  2. 保存文件,然后运行以下命令:

    go run vector-query.go
    Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.
    Score: 0.0045180833
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.004480799
    Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.
    Score: 0.0042421296
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.004227752
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.0042201905
    go run vector-query.go
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.4832950830459595
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.48093676567077637
    Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!
    Score: 0.4629695415496826
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.45800843834877014
    Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.
    Score: 0.45398443937301636
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 创建一个名为 CreateIndex.java 的文件并粘贴以下代码:

    CreateIndex.java
    import com.mongodb.MongoException;
    import com.mongodb.client.ListSearchIndexesIterable;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoCursor;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.model.SearchIndexModel;
    import com.mongodb.client.model.SearchIndexType;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.Collections;
    import java.util.List;
    public class CreateIndex {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_db");
    MongoCollection<Document> collection = database.getCollection("embeddings");
    // define the index details
    String indexName = "vector_index";
    int dimensionsHuggingFaceModel = 1024;
    int dimensionsOpenAiModel = 1536;
    Bson definition = new Document(
    "fields",
    Collections.singletonList(
    new Document("type", "vector")
    .append("path", "embedding")
    .append("numDimensions", <dimensions>) // replace with var for the model used
    .append("similarity", "dotProduct")));
    // define the index model using the specified details
    SearchIndexModel indexModel = new SearchIndexModel(
    indexName,
    definition,
    SearchIndexType.vectorSearch());
    // Create the index using the model
    try {
    List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
    System.out.println("Successfully created a vector index named: " + result);
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    // Wait for Atlas to build the index and make it queryable
    System.out.println("Polling to confirm the index has completed building.");
    System.out.println("It may take up to a minute for the index to build before you can query using it.");
    ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes();
    Document doc = null;
    while (doc == null) {
    try (MongoCursor<Document> cursor = searchIndexes.iterator()) {
    if (!cursor.hasNext()) {
    break;
    }
    Document current = cursor.next();
    String name = current.getString("name");
    boolean queryable = current.getBoolean("queryable");
    if (name.equals(indexName) && queryable) {
    doc = current;
    } else {
    Thread.sleep(500);
    }
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    }
    System.out.println(indexName + " index is ready to query");
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. <dimensions> 占位符值替换为所用模型的相应变量:

    • dimensionsHuggingFaceModel1024 维度("mixedbread-ai/mxbai-embed-large-v1" 模型)

    • dimensionsOpenAiModel1536 个维度("text-embedding-3-small" 模型)

    注意

    维数由用于生成嵌入的模型确定。 如果您调整此代码以使用不同的模型,请确保将正确的值传递给 numDimensions另请参阅“选择嵌入模型”部分。

  3. 保存并运行该文件。 输出类似如下所示:

    Successfully created a vector index named: [vector_index]
    Polling to confirm the index has completed building.
    It may take up to a minute for the index to build before you can query using it.
    vector_index index is ready to query

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 VectorQuery.java 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    VectorQuery.java
    import com.mongodb.MongoException;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.model.search.FieldSearchPath;
    import org.bson.BsonArray;
    import org.bson.BsonValue;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.ArrayList;
    import java.util.List;
    import static com.mongodb.client.model.Aggregates.project;
    import static com.mongodb.client.model.Aggregates.vectorSearch;
    import static com.mongodb.client.model.Projections.exclude;
    import static com.mongodb.client.model.Projections.fields;
    import static com.mongodb.client.model.Projections.include;
    import static com.mongodb.client.model.Projections.metaVectorSearchScore;
    import static com.mongodb.client.model.search.SearchPath.fieldPath;
    import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions;
    import static java.util.Arrays.asList;
    public class VectorQuery {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_db");
    MongoCollection<Document> collection = database.getCollection("embeddings");
    // define $vectorSearch query options
    String query = "ocean tragedy";
    EmbeddingProvider embeddingProvider = new EmbeddingProvider();
    BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query);
    List<Double> embedding = new ArrayList<>();
    for (BsonValue value : embeddingBsonArray.stream().toList()) {
    embedding.add(value.asDouble().getValue());
    }
    // define $vectorSearch pipeline
    String indexName = "vector_index";
    FieldSearchPath fieldSearchPath = fieldPath("embedding");
    int limit = 5;
    List<Bson> pipeline = asList(
    vectorSearch(
    fieldSearchPath,
    embedding,
    indexName,
    limit,
    exactVectorSearchOptions()
    ),
    project(
    fields(exclude("_id"), include("text"),
    metaVectorSearchScore("score"))));
    // run query and print results
    List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
    if (results.isEmpty()) {
    System.out.println("No results found.");
    } else {
    results.forEach(doc -> {
    System.out.println("Text: " + doc.getString("text"));
    System.out.println("Score: " + doc.getDouble("score"));
    });
    }
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. 保存并运行文件。输出类似于以下内容之一,具体取决于您使用的模型:

    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.004247286356985569
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.003116759704425931
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.002447686856612563
    Text: Titanic: The story of the 1912 sinking of the largest luxury liner ever built
    Score: 0.45522359013557434
    Text: Avatar: A marine is dispatched to the moon Pandora on a unique mission
    Score: 0.4049977660179138
    Text: The Lion King: Lion cub and future king Simba searches for his identity
    Score: 0.35942474007606506
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embeddings 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 创建一个名为 CreateIndex.java 的文件并粘贴以下代码:

    CreateIndex.java
    import com.mongodb.MongoException;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.ListSearchIndexesIterable;
    import com.mongodb.client.MongoCursor;
    import com.mongodb.client.model.SearchIndexModel;
    import com.mongodb.client.model.SearchIndexType;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.Collections;
    import java.util.List;
    public class CreateIndex {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
    MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
    // define the index details
    String indexName = "vector_index";
    int dimensionsHuggingFaceModel = 1024;
    int dimensionsOpenAiModel = 1536;
    Bson definition = new Document(
    "fields",
    Collections.singletonList(
    new Document("type", "vector")
    .append("path", "embeddings")
    .append("numDimensions", <dimensions>) // replace with var for the model used
    .append("similarity", "dotProduct")
    .append('quantization', "scalar")));
    // define the index model using the specified details
    SearchIndexModel indexModel = new SearchIndexModel(
    indexName,
    definition,
    SearchIndexType.vectorSearch());
    // create the index using the model
    try {
    List<String> result = collection.createSearchIndexes(Collections.singletonList(indexModel));
    System.out.println("Successfully created a vector index named: " + result);
    System.out.println("It may take up to a minute for the index to build before you can query using it.");
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    // wait for Atlas to build the index and make it queryable
    System.out.println("Polling to confirm the index has completed building.");
    ListSearchIndexesIterable<Document> searchIndexes = collection.listSearchIndexes();
    Document doc = null;
    while (doc == null) {
    try (MongoCursor<Document> cursor = searchIndexes.iterator()) {
    if (!cursor.hasNext()) {
    break;
    }
    Document current = cursor.next();
    String name = current.getString("name");
    boolean queryable = current.getBoolean("queryable");
    if (name.equals(indexName) && queryable) {
    doc = current;
    } else {
    Thread.sleep(500);
    }
    } catch (Exception e) {
    throw new RuntimeException(e);
    }
    }
    System.out.println(indexName + " index is ready to query");
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. <dimensions> 占位符值替换为所用模型的相应变量:

    • dimensionsHuggingFaceModel1024 维度(开源)

    • dimensionsOpenAiModel1536 维度

    注意

    维数由用于生成嵌入的模型确定。 如果您使用其他模型,请确保将正确的值传递给 numDimensions另请参阅“选择嵌入模型”部分。

  3. 保存并运行该文件。 输出类似如下所示:

    Successfully created a vector index named: [vector_index]
    Polling to confirm the index has completed building.
    It may take up to a minute for the index to build before you can query using it.
    vector_index index is ready to query

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 VectorQuery.java 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    VectorQuery.java
    import com.mongodb.MongoException;
    import com.mongodb.client.MongoClient;
    import com.mongodb.client.MongoClients;
    import com.mongodb.client.MongoCollection;
    import com.mongodb.client.MongoDatabase;
    import com.mongodb.client.model.search.FieldSearchPath;
    import org.bson.BsonArray;
    import org.bson.BsonValue;
    import org.bson.Document;
    import org.bson.conversions.Bson;
    import java.util.ArrayList;
    import java.util.List;
    import static com.mongodb.client.model.Aggregates.project;
    import static com.mongodb.client.model.Aggregates.vectorSearch;
    import static com.mongodb.client.model.Projections.exclude;
    import static com.mongodb.client.model.Projections.fields;
    import static com.mongodb.client.model.Projections.include;
    import static com.mongodb.client.model.Projections.metaVectorSearchScore;
    import static com.mongodb.client.model.search.SearchPath.fieldPath;
    import static com.mongodb.client.model.search.VectorSearchOptions.exactVectorSearchOptions;
    import static java.util.Arrays.asList;
    public class VectorQuery {
    public static void main(String[] args) {
    String uri = System.getenv("ATLAS_CONNECTION_STRING");
    if (uri == null || uri.isEmpty()) {
    throw new IllegalStateException("ATLAS_CONNECTION_STRING env variable is not set or is empty.");
    }
    // establish connection and set namespace
    try (MongoClient mongoClient = MongoClients.create(uri)) {
    MongoDatabase database = mongoClient.getDatabase("sample_airbnb");
    MongoCollection<Document> collection = database.getCollection("listingsAndReviews");
    // define the query and get the embedding
    String query = "beach house";
    EmbeddingProvider embeddingProvider = new EmbeddingProvider();
    BsonArray embeddingBsonArray = embeddingProvider.getEmbedding(query);
    List<Double> embedding = new ArrayList<>();
    for (BsonValue value : embeddingBsonArray.stream().toList()) {
    embedding.add(value.asDouble().getValue());
    }
    // define $vectorSearch pipeline
    String indexName = "vector_index";
    FieldSearchPath fieldSearchPath = fieldPath("embeddings");
    int limit = 5;
    List<Bson> pipeline = asList(
    vectorSearch(
    fieldSearchPath,
    embedding,
    indexName,
    limit,
    exactVectorSearchOptions()),
    project(
    fields(exclude("_id"), include("summary"),
    metaVectorSearchScore("score"))));
    // run query and print results
    List<Document> results = collection.aggregate(pipeline).into(new ArrayList<>());
    if (results.isEmpty()) {
    System.out.println("No results found.");
    } else {
    results.forEach(doc -> {
    System.out.println("Summary: " + doc.getString("summary"));
    System.out.println("Score: " + doc.getDouble("score"));
    });
    }
    } catch (MongoException me) {
    throw new RuntimeException("Failed to connect to MongoDB ", me);
    } catch (Exception e) {
    throw new RuntimeException("Operation failed: ", e);
    }
    }
    }
  2. 保存并运行文件。输出类似于以下内容之一,具体取决于您使用的模型:

    Summary: Near to underground metro station. Walking distance to seaside. 2 floors 1 entry. Husband, wife, girl and boy is living.
    Score: 0.004518083296716213
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.0044807991944253445
    Summary: Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.
    Score: 0.004242129623889923
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.004227751865983009
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.004220190457999706
    Summary: A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.
    Score: 0.4832950830459595
    Summary: Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.
    Score: 0.48092085123062134
    Summary: THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!
    Score: 0.4629460275173187
    Summary: A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.
    Score: 0.4581468403339386
    Summary: The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.
    Score: 0.45398443937301636
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_db.embeddings 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 dotProduct

  1. 创建一个名为 create-index.js 的文件并粘贴以下代码。

    create-index.js
    import { MongoClient } from 'mongodb';
    // connect to your Atlas deployment
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    const database = client.db("sample_db");
    const collection = database.collection("embeddings");
    // define your Atlas Vector Search index
    const index = {
    name: "vector_index",
    type: "vectorSearch",
    definition: {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": <dimensions>
    }
    ]
    }
    }
    // run the helper method
    const result = await collection.createSearchIndex(index);
    console.log(result);
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 768;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    node --env-file=.env create-index.js

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.js 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    vector-query.js
    import { MongoClient } from 'mongodb';
    import { getEmbedding } from './get-embeddings.js';
    // MongoDB connection URI and options
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    // Connect to the MongoDB client
    await client.connect();
    // Specify the database and collection
    const database = client.db("sample_db");
    const collection = database.collection("embeddings");
    // Generate embedding for the search query
    const queryEmbedding = await getEmbedding("ocean tragedy");
    // Define the sample vector search pipeline
    const pipeline = [
    {
    $vectorSearch: {
    index: "vector_index",
    queryVector: queryEmbedding,
    path: "embedding",
    exact: true,
    limit: 5
    }
    },
    {
    $project: {
    _id: 0,
    text: 1,
    score: {
    $meta: "vectorSearchScore"
    }
    }
    }
    ];
    // run pipeline
    const result = collection.aggregate(pipeline);
    // print results
    for await (const doc of result) {
    console.dir(JSON.stringify(doc));
    }
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 保存文件,然后运行以下命令:

    node --env-file=.env vector-query.js
    '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}'
    '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}'
    '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}'
    node --env-file=.env vector-query.js
    {"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926}
    {"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898}
    {"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

完成以下步骤,在 sample_airbnb.listingsAndReviews 集合上创建索引,将 embedding 字段指定为向量类型,将相似度测量指定为 euclidean

  1. 创建一个名为 create-index.js 的文件并粘贴以下代码。

    create-index.js
    import { MongoClient } from 'mongodb';
    // connect to your Atlas deployment
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    const database = client.db("sample_airbnb");
    const collection = database.collection("listingsAndReviews");
    // Define your Atlas Vector Search index
    const index = {
    name: "vector_index",
    type: "vectorSearch",
    definition: {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": <dimensions>,
    "quantization": "scalar"
    }
    ]
    }
    }
    // Call the method to create the index
    const result = await collection.createSearchIndex(index);
    console.log(result);
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 如果使用开源模型,则将 <dimensions> 占位符值替换为 768;如果使用 OpenAI 模型,则将占位符值替换为 1536

  3. 保存文件,然后运行以下命令:

    node --env-file=.env create-index.js

注意

构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2
  1. 创建一个名为 vector-query.js 的文件并粘贴以下代码。

    要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

    例如,此代码执行以下操作:

    vector-query.js
    import { MongoClient } from 'mongodb';
    import { getEmbedding } from './get-embeddings.js';
    // MongoDB connection URI and options
    const client = new MongoClient(process.env.ATLAS_CONNECTION_STRING);
    async function run() {
    try {
    // Connect to the MongoDB client
    await client.connect();
    // Specify the database and collection
    const database = client.db("sample_airbnb");
    const collection = database.collection("listingsAndReviews");
    // Generate embedding for the search query
    const queryEmbedding = await getEmbedding("beach house");
    // Define the sample vector search pipeline
    const pipeline = [
    {
    $vectorSearch: {
    index: "vector_index",
    queryVector: queryEmbedding,
    path: "embedding",
    exact: true,
    limit: 5
    }
    },
    {
    $project: {
    _id: 0,
    summary: 1,
    score: {
    $meta: "vectorSearchScore"
    }
    }
    }
    ];
    // run pipeline
    const result = collection.aggregate(pipeline);
    // print results
    for await (const doc of result) {
    console.dir(JSON.stringify(doc));
    }
    } finally {
    await client.close();
    }
    }
    run().catch(console.dir);
  2. 保存文件,然后运行以下命令:

    node --env-file=.env vector-query.js
    '{"summary":"Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.","score":0.5334879159927368}'
    '{"summary":"A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.","score":0.5240535736083984}'
    '{"summary":"The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.","score":0.5232879519462585}'
    '{"summary":"Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.","score":0.5186381340026855}'
    '{"summary":"A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.","score":0.5078228116035461}'
    node --env-file=.env vector-query.js
    {"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359}
    {"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646}
    {"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605}
    {"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792}
    {"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码会在集合上创建索引,并指定以下内容:

    • BSON-Float32-EmbeddingBSON-Int8-EmbeddingBSON-Int1-Embedding 字段作为向量类型字段。

    • euclidean 作为 int1 嵌入的相似度函数,dotProduct 作为 float32int8 嵌入的相似度类型。

    • 768 作为嵌入中的维数。

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "BSON-Float32-Embedding",
    "similarity": "dotProduct",
    "numDimensions": 768
    },
    {
    "type": "vector",
    "path": "BSON-Int8-Embedding",
    "similarity": "dotProduct",
    "numDimensions": 768
    },
    {
    "type": "vector",
    "path": "BSON-Int1-Embedding",
    "similarity": "euclidean",
    "numDimensions": 768
    }
    ]
    },
    name="vector_index",
    type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码在集合上创建索引,并将embedding 字段指定为向量类型,将相似度函数指定为dotProduct ,将维数指定为1536 。它还为 embedding字段中的向量启用自动标量量化,以高效处理向量数据。

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": 1536,
    "quantization": "scalar"
    }
    ]
    },
    name="vector_index",
    type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2

要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

例如,此代码执行以下操作:

注意

查询可能需要一些时间才能完成。

import pymongo
from bson.binary import Binary
# Prepare your query
query_text = "ocean tragedy"
# Generate embedding for the search query
query_float32_embeddings = get_embedding(query_text, precision="float32")
query_int8_embeddings = get_embedding(query_text, precision="int8")
query_int1_embeddings = get_embedding(query_text, precision="ubinary")
# Convert each embedding to BSON format
query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32)
query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8)
query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT)
# Define vector search pipeline for each precision
pipelines = []
for query_embedding, path in zip(
[query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings],
["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"]
):
pipeline = [
{
"$vectorSearch": {
"index": "vector_index", # Adjust if necessary
"queryVector": query_embedding,
"path": path,
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"data": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
pipelines.append(pipeline)
# Execute the search for each precision
for pipeline in pipelines:
print(f"Results for {pipeline[0]['$vectorSearch']['path']}:")
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
Results for BSON-Float32-Embedding:
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.7661113739013672}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.7050272822380066}
{'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.7024770379066467}
{'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.7011005282402039}
{'data': 'E.T. the Extra-Terrestrial: A young boy befriends an alien stranded on Earth and helps him return home.', 'score': 0.6877288818359375}
Results for BSON-Int8-Embedding:
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.5}
{'data': 'The Lion King: Lion cub and future king Simba searches for his identity', 'score': 0.5}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.5}
{'data': "Inception: A skilled thief is given a chance at redemption if he can successfully implant an idea into a person's subconscious.", 'score': 0.5}
{'data': 'The Godfather: The aging patriarch of a powerful crime family transfers control of his empire to his reluctant son.', 'score': 0.5}
Results for BSON-Int1-Embedding:
{'data': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built', 'score': 0.671875}
{'data': 'The Shawshank Redemption: A banker is sentenced to life in Shawshank State Penitentiary for the murders of his wife and her lover.', 'score': 0.6484375}
{'data': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission', 'score': 0.640625}
{'data': 'Back to the Future: A teenager accidentally travels back in time and must ensure his parents fall in love.', 'score': 0.6145833134651184}
{'data': 'Jurassic Park: Scientists clone dinosaurs to populate an island theme park, which soon goes awry.', 'score': 0.61328125}
# Generate embedding for the search query
query_embedding = get_embedding("ocean tragedy")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embedding",
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"text": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926}
{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898}
{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}
1

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码会在集合上创建索引,并指定以下内容:

    • BSON-Float32-EmbeddingBSON-Int8-EmbeddingBSON-Int1-Embedding 字段作为向量类型字段。

    • euclidean 作为 int1 嵌入的相似度函数,dotProduct 作为 float32int8 嵌入的相似度类型。

    • 768 作为嵌入中的维数。

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "BSON-Float32-Embedding",
    "similarity": "dotProduct",
    "numDimensions": 768
    },
    {
    "type": "vector",
    "path": "BSON-Int8-Embedding",
    "similarity": "dotProduct",
    "numDimensions": 768
    },
    {
    "type": "vector",
    "path": "BSON-Int1-Embedding",
    "similarity": "euclidean",
    "numDimensions": 768
    }
    ]
    },
    name="vector_index",
    type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

要对数据启用向量搜索查询,您必须在集合上创建 Atlas Vector Search 索引。

  1. 请将以下代码粘贴到您的笔记本中。

    此代码在集合上创建索引,并将embedding 字段指定为向量类型,将相似度函数指定为dotProduct ,将维数指定为1536 。它还为 embedding字段中的向量启用自动标量量化,以高效处理向量数据。

    from pymongo.operations import SearchIndexModel
    # Create your index model, then create the search index
    search_index_model = SearchIndexModel(
    definition = {
    "fields": [
    {
    "type": "vector",
    "path": "embedding",
    "similarity": "dotProduct",
    "numDimensions": 1536,
    "quantization": "scalar"
    }
    ]
    },
    name="vector_index",
    type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)
  2. 运行代码。

    构建索引大约需要一分钟时间。在建立索引时,索引处于初始同步状态。 构建完成后,您可以开始查询集合中的数据。

有关更多信息,请参阅创建 Atlas Vector Search 索引。

2

要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

例如,此代码执行以下操作:

import pymongo
from bson.binary import Binary
# Prepare your query
query_text = "beach house"
# Generate embedding for the search query
query_float32_embeddings = get_embedding(query_text, precision="float32")
query_int8_embeddings = get_embedding(query_text, precision="int8")
query_int1_embeddings = get_embedding(query_text, precision="ubinary")
# Convert each embedding to BSON format
query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32)
query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8)
query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT)
# Define vector search pipeline for each precision
pipelines = []
for query_embedding, path in zip(
[query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings],
["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"]
):
pipeline = [
{
"$vectorSearch": {
"index": "vector_index", # Adjust if necessary
"queryVector": query_embedding,
"path": path,
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"summary": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
pipelines.append(pipeline)
# Execute the search for each precision
for pipeline in pipelines:
print(f"Results for {pipeline[0]['$vectorSearch']['path']}:")
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
Results for BSON-Float32-Embedding:
{'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.7847104072570801}
{'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.', 'score': 0.7780507802963257}
{'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", 'score': 0.7723637223243713}
{'summary': 'Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.', 'score': 0.7665778398513794}
{'summary': 'A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.', 'score': 0.7593404650688171}
Results for BSON-Int8-Embedding:
{'summary': 'Fantastic duplex apartment with three bedrooms, located in the historic area of Porto, Ribeira (Cube) - UNESCO World Heritage Site. Centenary building fully rehabilitated, without losing their original character.', 'score': 0.5}
{'summary': 'One bedroom + sofa-bed in quiet and bucolic neighbourhood right next to the Botanical Garden. Small garden, outside shower, well equipped kitchen and bathroom with shower and tub. Easy for transport with many restaurants and basic facilities in the area.', 'score': 0.5}
{'summary': "A short distance from Honolulu's billion dollar mall, and the same distance to Waikiki. Parking included. A great location that work perfectly for business, education, or simple visit. Experience Yacht Harbor views and 5 Star Hilton Hawaiian Village.", 'score': 0.5}
{'summary': 'Here exists a very cozy room for rent in a shared 4-bedroom apartment. It is located one block off of the JMZ at Myrtle Broadway. The neighborhood is diverse and appeals to a variety of people.', 'score': 0.5}
{'summary': 'Quarto com vista para a Lagoa Rodrigo de Freitas, cartão postal do Rio de Janeiro. Linda Vista. 1 Quarto e 1 banheiro Amplo, arejado, vaga na garagem. Prédio com piscina, sauna e playground. Fácil acesso, próximo da praia e shoppings.', 'score': 0.5}
Results for BSON-Int1-Embedding:
{'summary': 'Having a large airy living room. The apartment is well divided. Fully furnished and cozy. The building has a 24h doorman and camera services in the corridors. It is very well located, close to the beach, restaurants, pubs and several shops and supermarkets. And it offers a good mobility being close to the subway.', 'score': 0.6901041865348816}
{'summary': 'Cozy and comfortable apartment. Ideal for families and vacations. 3 bedrooms, 2 of them suites. Located 20-min walk to the beach and close to the Rio 2016 Olympics Venues. Situated in a modern and secure condominium, with many entertainment available options around.', 'score': 0.6731770634651184}
{'summary': "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", 'score': 0.6731770634651184}
{'summary': 'The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.', 'score': 0.671875}
{'summary': 'THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!', 'score': 0.6705729365348816}

要运行向量搜索查询,请生成一个查询向量以传递到聚合管道中。

例如,此代码执行以下操作:

# Generate embedding for the search query
query_embedding = get_embedding("beach house")
# Sample vector search pipeline
pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"queryVector": query_embedding,
"path": "embedding",
"exact": True,
"limit": 5
}
},
{
"$project": {
"_id": 0,
"summary": 1,
"score": {
"$meta": "vectorSearchScore"
}
}
}
]
# Execute the search
results = collection.aggregate(pipeline)
# Print results
for i in results:
print(i)
{"summary": "A friendly apartment block where everyone knows each other and there is a strong communal vibe. Property has a huge backyard with vege garden and skate ramp. 7min walk to the beach and 2min to buses.", "score": 0.483333021402359}
{"summary": "Room 2 Private room in charming recently renovated federation guest house at Coogee Beach. Prices are per room for 2 People only. A queen and a single bed. Not suitable for group booking All rooms have TV, desk, wardrobe, beds, unlimited wifi 2 mins from the beach, cafes and transport. This is not a party house but a safe and clean place to stay. Share bathrooms and kitchen... All common areas are cleaned daily.", "score": 0.48092877864837646}
{"summary": "THIS IS A VERY SPACIOUS 1 BEDROOM FULL CONDO (SLEEPS 4) AT THE BEAUTIFUL VALLEY ISLE RESORT ON THE BEACH IN LAHAINA, MAUI!! YOU WILL LOVE THE PERFECT LOCATION OF THIS VERY NICE HIGH RISE! ALSO THIS SPACIOUS FULL CONDO, FULL KITCHEN, BIG BALCONY!!", "score": 0.46294474601745605}
{"summary": "A beautiful and comfortable 1 Bedroom Air Conditioned Condo in Makaha Valley - stunning Ocean & Mountain views All the amenities of home, suited for longer stays. Full kitchen & large bathroom. Several gas BBQ's for all guests to use & a large heated pool surrounded by reclining chairs to sunbathe. The Ocean you see in the pictures is not even a mile away, known as the famous Makaha Surfing Beach. Golfing, hiking,snorkeling paddle boarding, surfing are all just minutes from the front door.", "score": 0.4580020606517792}
{"summary": "The Apartment has a living room, toilet, bedroom (suite) and American kitchen. Well located, on the Copacabana beach block a 05 Min. walk from Ipanema beach (Arpoador). Internet wifi, cable tv, air conditioning in the bedroom, ceiling fans in the bedroom and living room, kitchen with microwave, cooker, Blender, dishes, cutlery and service area with fridge, washing machine, clothesline for drying clothes and closet with several utensils for use. The property boasts 45 m2.", "score": 0.45400717854499817}

在创建向量嵌入时请考虑以下因素:

要创建向量嵌入,您必须使用嵌入模型。嵌入模型是用于将数据转换为嵌入的算法。您可以选择以下方法之一连接到嵌入模型并创建向量嵌入:

方法
说明

加载开源模型

如果您没有专有嵌入模型的 API 密钥,请从您的应用程序本地加载开源嵌入模型。

使用专有模型

大多数 AI 提供商都为其专有的嵌入模型提供 API,您可以使用这些模型创建向量嵌入。

利用一个集成

您可以将 Atlas Vector Search 与开源框架和 AI 服务集成,以快速连接到开源和专有嵌入模型,并为 Atlas Vector Search 生成矢量嵌入。

要了解更多信息,请参阅将 Vector Search 与 AI 技术集成

您选择的嵌入模型会影响您的查询结果,并决定您在 Atlas Vector Search 索引中指定的维度数。每种模型都提供不同的优势,具体取决于您的数据和使用案例。

有关常用嵌入模型的列表,请参阅海量文本嵌入基准 (MTEB) 。此列表提供了对各种开源和专有文本嵌入模型的深入见解,并允许您按使用案例、模型类型和特定模型指标来过滤模型。

为 Atlas Vector Search 选择嵌入模型时,请考虑以下指标:

  • 嵌入维度:向量嵌入的长度。

    较小的嵌入可以提高存储效率,而较大的嵌入可以捕获数据中更细微的关系。您选择的模型应在效率和复杂性之间取得平衡。

  • Max Tokens(最大标记数):可在单个嵌入中压缩的令牌 数。

  • 模型大小:模型的大小,以千兆字节为单位。

    虽然较大的模型性能更好,但当您将 Atlas Vector Search 扩展到生产环境时,它们需要更多的计算资源。

  • 检索平均值:衡量检索系统性能的分数。

    分数越高,表示模型更擅长在检索结果列表中将相关文档排在较高的位置。在为 RAG 应用程序选择模型时,此分数很重要。

请考虑以下策略,以确保您的嵌入是正确且最佳的:

如果您在嵌入时遇到问题,请考虑以下策略:

  • 验证您的环境。

    检查是否已安装必要的依赖项且这些依赖项是最新的。 库版本冲突可能会导致意外行为。 创建新环境并仅安装所需的包,确保不存在冲突。

    注意

    如果使用 Colab,请确保笔记本会话的 IP 地址包含在 Atlas 项目的访问列表中。

  • 监控内存使用情况。

    如果您遇到性能问题,请检查 RAM、CPU 和磁盘使用情况,以确定任何潜在的瓶颈。对于 Colab 或 Jupyter Notebooks 等托管环境,请确保您的实例预配了足够的资源,并在必要时升级实例。

  • 确保维度一致

    验证Atlas Vector Search索引定义是否与Atlas中存储的嵌入维度相匹配,以及您的查询嵌入与索引嵌入的维度相匹配。 否则,在运行向量搜索查询时可能会遇到错误。

要解决特定问题,请参阅故障排除。

学会了如何创建嵌入和使用 Atlas Vector Search 查询嵌入之后,就可以通过实施检索增强生成 (RAG) 开始构建生成式人工智能应用:

您还可以将嵌入转换为BSON向量,以便在Atlas中高效存储和摄取向量。要学习;了解更多信息,请参阅如何摄取预量化向量。

后退

Atlas Vector Search 快速入门