Usar a Atlas Vector Search do MongoDB com LangChain

Arek Borucki6 min read • Published Sep 18, 2024 • Updated Sep 18, 2024

IA Atlas Pesquisa vetorial Python

Avaliar este tutorial

Introdução à pesquisa vetorial no MongoDB Atlas

Os mecanismos de pesquisa vetorial, também chamados de bancos de dados vetoriais, pesquisa semântica (pesquisa semântica) ou pesquisa por cosseno, localizam as entradas mais próximas de uma determinada consulta vetorizada. Enquanto os métodos de pesquisa convencionais dependem de referências de palavras-chave, correspondência lexical e taxa de exibição de palavras, os mecanismos de pesquisa vetorial medem a similaridade com base na distância da dimensão de incorporação. Encontrar dados relacionados agora é pesquisar os vizinhos mais próximos da sua consulta.

As incorporações vetoriais atuam como a representação numérica dos dados e do seu respectivo contexto, preservados em vetores de alta dimensão (densos). Há vários modelos, tanto privados (como os da OpenAI e da Hugging Face) quanto de código aberto (como o FastText), desenvolvidos para gerar essas incorporações. Esses modelos podem ser treinados em milhões de amostras para fornecer resultados mais pertinentes e precisos. Em determinadas situações, os dados numéricos que você coletou ou criou para exibir as características essenciais dos seus documentos podem servir como embeddings. O essencial é ter um mecanismo de pesquisa eficiente, como o MongoDB Atlas.

O MongoDB Atlas é um banco de dados de nuvem completamente gerenciado oferecido na AWS, Azure e GCP.Recentemente, ele incorporou recursos nativos de pesquisa vetorial para dados de documentos do MongoDB. A Atlas Vector Search utiliza o algoritmo Hierarchical Navigable Small Worlds para executar pesquisas semânticas.Como a Atlas Vector Search permite consultas aNN, você pode usá-las para localizar resultados análogos a um produto específico, realizar pesquisas de imagem e muito mais.

Operador $vectorSearch

Atlas Vector Search queries take the form of an aggregation pipeline stage and use the new $vectorSearch operator. The $vectorSearch stage performs an aNN search on a vector in the specified field. The field you intend to search must be indexed with the Atlas Vector Search vector type. The $vectorSearch must be the first stage of any pipeline where it appears.

Introdução ao LangChain

O LangChain é uma estrutura personalizada que usa grandes modelos de linguagem (LLM) para simplificar a criação de aplicativos.É uma estrutura de código aberto que ajuda no desenvolvimento de aplicativos com tecnologia de modelos de linguagem, focando principalmente em grandes modelos de linguagem. A estrutura vai além das chamadas de API padrão, pois reconhece dados e é agente, o que facilita as conexões com várias fontes de dados para fornecer experiências mais completas e personalizadas.

Esse recurso melhora a capacidade do aplicativo de interagir com diferentes conjuntos de dados e aprimorar sua funcionalidade com base nos dados obtidos. Por exemplo, um desenvolvedor pode usar LangChain para criar um aplicativo no qual a consulta de um usuário é processada por um grande modelo de linguagem, que em seguida gera uma representação vetorial da consulta. Essa representação vetorial pode ser utilizada para pesquisar dados vetoriais armazenados no MongoDB Atlas utilizando seu recurso de pesquisa vetorial. Os resultados do MongoDB Atlas podem então ser exibidos ao usuário ou processados pelo modelo de linguagem para fornecer respostas mais detalhadas ou personalizadas.

Agora, vamos consolidar todos esses elementos em uma visão arquitetônica.

Com a base teórica traçada, chegou a hora de fazer a transição da compreensão conceitual para a aplicação prática. Vamos nos aprofundar no processo de implementação para ver esses conceitos na prática.

Configuração do ambiente

Criar um índice de pesquisa

The first step is creating the vector search index. In the Atlas UI (you can also use Atlas Vector Search with local Atlas deployments that you create with the Atlas CLI), choose Pesquisa and Create Search Index. Please also visit the official MongoDB documentation to learn more.

Next, utilize the JSON Editor to configure fields of type vector. I named the field containing the embedding vector embedding.

1 {
2   "fields": [
3     {
4       "type": "vector",
5       "path": "embedding",  
6       "numDimensions": 1536, 
7       "similarity": "cosine"  
8     }
9   ]
10 }

Especifique o namespace (database e coleção) no qual o índice de pesquisa vetorial deve ser criado. Eu escolhi o namespace langchain.vectorSearch.

The similarity field in the vector definition specifies the function to use for searching the top K-nearest neighbors. The values can be:

euclidean: mede a distância do end-point do vetor quanto à similaridade entre dimensões variadas.
cosine: mede a similaridade angular e independente da magnitude; não adequado para vetores de magnitude zero. Para similaridade de cosseno, recomenda-se normalizar vetores e usar dotProduct.
dotProduct: semelhante a cosine, mas considera a magnitude do vetor, permitindo uma medição eficiente da similaridade com base no ângulo e na magnitude. Normalize o vetor quanto ao comprimento da unidade no tempo de consulta e índice para usá-lo.

Você pode executar pesquisas semânticas no seu cluster do Atlas executando o MongoDB versão 6.0.11 ou posterior. Ele permite o armazenamento de incorporações vetoriais para qualquer tipo de dados, juntamente com outros dados em sua coleção no cluster do Atlas. O Atlas Vector Search aceita incorporações de até 4096 dimensões.

Agora que configuramos a Atlas Vector Search, vamos passar para a configuração do LangChain.

LangChain e OpenAI

In this article, we will utilize OpenAI to generate vector embeddings. Firstly, you will need the OpenAI API Key. Create an account, and then locate your Secret API key in your user settings.

Para instalar o LangChain, primeiro você precisará atualizar o pip para Python ou npm para JavaScript e, em seguida, usar o respectivo comando de instalação. Estas são as etapas:

Para a versão do Python, use:

1 pip3 install pip --upgrade
2 pip3 install langchain

We will also need other Python modules, such as pymongo for communication with MongoDB Atlas, openai for communication with the OpenAI API, and pypdf` `and tiktoken`` for other functionalities.

1 pip3 install pymongo openai pypdf tiktoken

Começar a usar a Atlas Vector Search

In our exercise, we utilize a publicly accessible PDF document titled "MongoDB Atlas Best Practices" as a data source for constructing a text-searchable vector space. The implemented Python script employs several modules to process, vectorize, and index the document's content into a MongoDB Atlas collection.

Para implementá-lo, vamos começar configurando e exportando as variáveis de ambiente. Precisamos da string de conexão do Atlas e da chave da OpenAI API.

1 export OPENAI_API_KEY="xxxxxxxxxxx"
2 export ATLAS_CONNECTION_STRING="mongodb+srv://user:passwd@vectorsearch.abc.mongodb.net/?retryWrites=true"

Next, we can execute the code provided below. This script retrieves a PDF from a specified URL, segments the text, and indexes it in MongoDB Atlas for text search, leveraging LangChain's embedding and vector search features. The full code is accessible on Github.

1 import os
2 from pymongo import MongoClient
3 from langchain.document_loaders import PyPDFLoader
4 from langchain.text_splitter import RecursiveCharacterTextSplitter
5 from langchain.embeddings import OpenAIEmbeddings
6 from langchain_mongodb import MongoDBAtlasVectorSearch
7 
8 # Define the URL of the PDF MongoDB Atlas Best Practices document
9 pdf_url = "https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP"
10 
11 # Retrieve environment variables for sensitive information
12 OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
13 if not OPENAI_API_KEY:
14     raise ValueError("The OPENAI_API_KEY environment variable is not set.")
15 
16 ATLAS_CONNECTION_STRING = os.getenv('ATLAS_CONNECTION_STRING')
17 if not ATLAS_CONNECTION_STRING:
18     raise ValueError("The ATLAS_CONNECTION_STRING environment variable is not set.")
19 
20 # Connect to MongoDB Atlas cluster using the connection string
21 cluster = MongoClient(ATLAS_CONNECTION_STRING)
22 
23 # Define the MongoDB database and collection names
24 DB_NAME = "langchain"
25 COLLECTION_NAME = "vectorSearch"
26 
27 # Connect to the specific collection in the database
28 MONGODB_COLLECTION = cluster[DB_NAME][COLLECTION_NAME]
29 
30 # Initialize the PDF loader with the defined URL
31 loader = PyPDFLoader(pdf_url)
32 data = loader.load()
33 
34 # Initialize the text splitter
35 text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
36 
37 # Split the document into manageable segments
38 docs = text_splitter.split_documents(data)
39 
40 # Initialize MongoDB Atlas vector search with the document segments
41 vector_search = MongoDBAtlasVectorSearch.from_documents(
42     documents=docs,
43     embedding=OpenAIEmbeddings(),
44     collection=MONGODB_COLLECTION,
45     index_name="default"  # Use a predefined index name
46 )
47 # At this point, 'docs' are split and indexed in MongoDB Atlas, enabling text search capabilities.

Upon completion of the script, the PDF has been segmented and its vector representations are now stored within the langchain.vectorSearch namespace in MongoDB Atlas.

Executar consulta de pesquisa de semelhanças na Atlas Vector Search

"MongoDB Atlas auditing" serves as our search statement for initiating similarity searches. By utilizing the <code><em>OpenAIEmbeddings</em></code> class, we'll generate vector embeddings for this phrase. Following that, a similarity search will be executed to find and extract the three most semantically related documents from our MongoDB Atlas collection that align with our search intent.

In the first step, we need to create a MongoDBAtlasVectorSearch object:

1 def create_vector_search():
2     """
3     Creates a MongoDBAtlasVectorSearch object using the connection string, database, and collection names, along with the OpenAI embeddings and index configuration.
4 
5     :return: MongoDBAtlasVectorSearch object
6     """
7     vector_search = MongoDBAtlasVectorSearch.from_connection_string(
8         ATLAS_CONNECTION_STRING,
9         f"{DB_NAME}.{COLLECTION_NAME}",
10         OpenAIEmbeddings(),
11         index_name="default"
12     )
13     return vector_search

Posteriormente, podemos realizar uma pesquisa por similaridade.

1 def perform_similarity_search(query, top_k=3):
2     """
3     This function performs a similarity search within a MongoDB Atlas collection. It leverages the capabilities of the MongoDB Atlas Search, which under the hood, may use the `$vectorSearch` operator, to find and return the top `k` documents that match the provided query semantically.
4 
5     :param query: The search query string.
6     :param top_k: Number of top matches to return.
7     :return: A list of the top `k` matching documents with their similarity scores.
8     """
9 
10    # Get the MongoDBAtlasVectorSearch object
11     vector_search = create_vector_search()
12     
13     # Execute the similarity search with the given query
14     results = vector_search.similarity_search_with_score(
15         query=query,
16         k=top_k,
17     )
18     
19     return results
20 
21 # Example of calling the function directly
22 search_results = perform_similarity_search("MongoDB Atlas auditing")

The function returns the most semantically relevant documents from a MongoDB Atlas collection that correspond to a specified search query. When executed, it will provide a list of documents that are most similar to the query "MongoDB Atlas auditing". Each entry in this list includes the document's content that matches the search along with a similarity score, reflecting how closely each document aligns with the intent of the query. The function returns the top k matches, which by default is set to 5 but can be specified for any number of top results desired. Please find the code on GitHub.

Resumo

MongoDB Atlas Vector Search enhances AI applications by facilitating the embedding of vector data into MongoDB documents. It simplifies the creation of search indices and the execution of KNN searches through the $vectorSearch MQL stage, utilizing the Hierarchical Navigable Small Worlds algorithm for efficient nearest neighbor searches. The collaboration with LangChain leverages this functionality, contributing to more streamlined and powerful semantic search capabilities. Harness the potential of MongoDB Atlas Vector Search and LangChain to meet your semantic search needs today!

Na próxima postagem do blog, vamos nos aprofundar nos LangChain Templates, um novo conjunto de recursos para aprimorar a funcionalidade da Atlas Vector Search do MongoDB. Além disso, examinaremos a função da geração aumentada de recuperação (RAG) na pesquisa semântica e no desenvolvimento de IA. Não perca a exploração detalhada em nosso próximo artigo!

Questions? Comments? We’d love to continue the conversation over in the Developer Community forum.

Principais comentários nos fóruns

Ainda não há comentários sobre este artigo.

Iniciar a conversa

Avaliar este tutorial

Relacionado

Tutorial

Como implantar o MongoDB no Heroku

Oct 26, 2022 | 8 min read

Tutorial

Tutorial do MongoDB Atlas Data Federation: consultas federadas e $out para AWS S3

Jan 23, 2024 | 7 min read

Tutorial

Múltiplas conexões MongoDB em um único aplicativo

Apr 02, 2024 | 10 min read

Tutorial

Como automatizar a cópia contínua de dados do MongoDB para o S3

Jan 23, 2024 | 8 min read

Sumário

Introdução à pesquisa vetorial no MongoDB Atlas
Operador $vectorSearch
Introdução ao LangChain
Configuração do ambiente
Resumo

1	{
2	"fields": [
3	{
4	"type": "vector",
5	"path": "embedding",
6	"numDimensions": 1536,
7	"similarity": "cosine"
8	}
9	]
10	}

1	export OPENAI_API_KEY="xxxxxxxxxxx"
2	export ATLAS_CONNECTION_STRING="mongodb+srv://user:passwd@vectorsearch.abc.mongodb.net/?retryWrites=true"

1	import os
2	from pymongo import MongoClient
3	from langchain.document_loaders import PyPDFLoader
4	from langchain.text_splitter import RecursiveCharacterTextSplitter
5	from langchain.embeddings import OpenAIEmbeddings
6	from langchain_mongodb import MongoDBAtlasVectorSearch
7
8	# Define the URL of the PDF MongoDB Atlas Best Practices document
9	pdf_url = "https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP"
10
11	# Retrieve environment variables for sensitive information
12	OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
13	if not OPENAI_API_KEY:
14	raise ValueError("The OPENAI_API_KEY environment variable is not set.")
15
16	ATLAS_CONNECTION_STRING = os.getenv('ATLAS_CONNECTION_STRING')
17	if not ATLAS_CONNECTION_STRING:
18	raise ValueError("The ATLAS_CONNECTION_STRING environment variable is not set.")
19
20	# Connect to MongoDB Atlas cluster using the connection string
21	cluster = MongoClient(ATLAS_CONNECTION_STRING)
22
23	# Define the MongoDB database and collection names
24	DB_NAME = "langchain"
25	COLLECTION_NAME = "vectorSearch"
26
27	# Connect to the specific collection in the database
28	MONGODB_COLLECTION = cluster[DB_NAME][COLLECTION_NAME]
29
30	# Initialize the PDF loader with the defined URL
31	loader = PyPDFLoader(pdf_url)
32	data = loader.load()
33
34	# Initialize the text splitter
35	text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
36
37	# Split the document into manageable segments
38	docs = text_splitter.split_documents(data)
39
40	# Initialize MongoDB Atlas vector search with the document segments
41	vector_search = MongoDBAtlasVectorSearch.from_documents(
42	documents=docs,
43	embedding=OpenAIEmbeddings(),
44	collection=MONGODB_COLLECTION,
45	index_name="default" # Use a predefined index name
46	)
47	# At this point, 'docs' are split and indexed in MongoDB Atlas, enabling text search capabilities.

1	def create_vector_search():
2	"""
3	Creates a MongoDBAtlasVectorSearch object using the connection string, database, and collection names, along with the OpenAI embeddings and index configuration.
4
5	:return: MongoDBAtlasVectorSearch object
6	"""
7	vector_search = MongoDBAtlasVectorSearch.from_connection_string(
8	ATLAS_CONNECTION_STRING,
9	f"{DB_NAME}.{COLLECTION_NAME}",
10	OpenAIEmbeddings(),
11	index_name="default"
12	)
13	return vector_search

1	def perform_similarity_search(query, top_k=3):
2	"""
3	This function performs a similarity search within a MongoDB Atlas collection. It leverages the capabilities of the MongoDB Atlas Search, which under the hood, may use the `$vectorSearch` operator, to find and return the top `k` documents that match the provided query semantically.
4
5	:param query: The search query string.
6	:param top_k: Number of top matches to return.
7	:return: A list of the top `k` matching documents with their similarity scores.
8	"""
9
10	# Get the MongoDBAtlasVectorSearch object
11	vector_search = create_vector_search()
12
13	# Execute the similarity search with the given query
14	results = vector_search.similarity_search_with_score(
15	query=query,
16	k=top_k,
17	)
18
19	return results
20
21	# Example of calling the function directly
22	search_results = perform_similarity_search("MongoDB Atlas auditing")