Parte #3: pesquise seus dados semanticamente com o Atlas Vector Search do MongoDB

Dominic Frei6 min read • Published Feb 09, 2024 • Updated Sep 18, 2024

IA AWS Sem servidor Atlas Pesquisa vetorial Python

Avaliar este tutorial

This final part of the series will show you how to use the Amazon SageMaker endpoint created in the previous part and perform a semantic search on your data using MongoDB Atlas Vector Search.

Este artigo é parte de uma série de três partes:

The two parts shown in this tutorial will be:

Creating and updating embeddings/vectors for your data.
Creating vectors for a search query and sending them via Atlas Vector Search.

Creating a MongoDB cluster and loading the sample data

If you haven’t done so, create a new cluster in your MongoDB Atlas account. Make sure to check [Add sample dataset](https://www.mongodb.com/pt-br/developer/products/atlas/atlas-sample-datasets/) to get the sample data we will be working with right away into your cluster.

If you are using a cluster that has already been set up, click on the three dots in your cluster and then Load Sample Dataset.

Creating a Vector Search index

There is one more step we need to take in Atlas, which is creating a search index, specifically for Vector Search.

In your database overview, click on Create Index.

The Search page will be shown. Click on Create Search Index here.

Then choose Atlas Vector Search -> JSON Editor.

Open the sample_mflix database and choose the embedded_movies collection (not movies). Name your index (here, VectorSearchIndex).

The configuration for the index needs to state the number of dimensions. That depends on the model and in our case, it’s 384. The path tells the index which field will be used to store the vectors. We’ll call it embedding here. The similarity for text is usually done with the cosine function.

1 {
2   "fields": [
3     {
4       "numDimensions": 384,
5       "path": "embedding",
6       "similarity": "cosine",
7       "type": "vector"
8     }
9   ]
10 }

Hit Create and you’re good to go.

If you want to, you can test your new index before continuing.

Preparing embeddings

Are you ready for the final part?

Let’s have a look at the code (here, in Python)!

You can find the full repository on GitHub.

In the following section, we will look at the three relevant files that show you how you can implement a server app that uses the Amazon SageMaker endpoint.

Accessing the endpoint: sagemaker.py

The sagemaker.py module is the wrapper around the Lambda/Gateway endpoint that we created in the previous example.

Make sure to create a .env file with the URL saved in EMBDDING_SERVICE.

It should look like this:

1 MONGODB_CONNECTION_STRING="mongodb+srv://<username>:<password>@<cluster_id>.mongodb.net/?retryWrites=true&w=majority"
2 EMBEDDING_SERVICE="https://<your_endpoint>.amazonaws.com/TEST/sageMakerResource"

The following function will then attach the query that we want to search for to the URL and execute it.

1 import os
2 from typing import Optional
3 from urllib.parse import quote
4 
5 import requests
6 from dotenv import load_dotenv
7 
8 load_dotenv()
9 
10 EMBEDDING_SERVICE = os.environ.get("EMBEDDING_SERVICE")

As a result, we expect to find the vector in a JSON field called embedding.

1 def create_embedding(plot: str) -> Optional[float]:
2     encoded_plot = quote(plot)
3     embedding_url = f"{EMBEDDING_SERVICE}?query={encoded_plot}"
4 
5     embedding_response = requests.get(embedding_url)
6     embedding_vector = embedding_response.json()["embedding"]
7 
8     return embedding_vector

Access and searching the data: atlas.py

The module atlas.py is the wrapper around everything MongoDB Atlas.

Similar to sagemaker.py, we first grab the MONGODB_CONNECTION_STRING that you can retrieve in Atlas by clicking on Connect in your cluster. It’s the authenticated URL to your cluster. We need to save MONGODB_CONNECTION_STRING to our .env file too.

We then go ahead and define a bunch of variables that we’ve set in earlier parts, like VectorSearchIndex and embedding, along with the automatically created sample_mflix demo data.

Using the Atlas driver for Python (called PyMongo), we then create a MongoClient which holds the connection to the Atlas cluster.

1 import os
2 
3 from dotenv import load_dotenv
4 from pymongo import MongoClient, UpdateOne
5 
6 from sagemaker import create_embedding
7 
8 load_dotenv()
9 
10 MONGODB_CONNECTION_STRING = os.environ.get("MONGODB_CONNECTION_STRING")
11 DATABASE_NAME = "sample_mflix"
12 COLLECTION_NAME = "embedded_movies"
13 VECTOR_SEARCH_INDEX_NAME = "VectorSearchIndex"
14 EMBEDDING_PATH = "embedding"
15 mongo_client = MongoClient(MONGODB_CONNECTION_STRING)
16 database = mongo_client[DATABASE_NAME]
17 movies_collection = database[COLLECTION_NAME]

The first step will be to actually prepare the already existing data with embeddings.

This is the sole purpose of the add_missing_embeddings function.

We’ll create a filter for the documents with missing embeddings and retrieve those from the database, only showing their plot, which is the only field we’re interested in for now.

Assuming we will only find a couple every time, we can then go through them and call the create_embedding endpoint for each, creating an embedding for the plot of the movie.

We’ll then add those new embeddings to the movies_to_update array so that we eventually only need one bulk_write to the database, which makes the call more efficient.

Note that for huge datasets with many embeddings to create, you might want to adjust the lambda function to take an array of queries instead of just a single query. For this simple example, it will do.

1 def add_missing_embeddings():
2     movies_with_a_plot_without_embedding_filter = {
3         "$and": [
4             {"plot": {"$exists": True, "$ne": ""}},
5             {"embedding": {"$exists": False}},
6         ]
7     }
8     only_show_plot_projection = {"plot": 1}
9 
10     movies = movies_collection.find(
11         movies_with_a_plot_without_embedding_filter,
12         only_show_plot_projection,
13     )
14 
15     movies_to_update = []
16 
17     for movie in movies:
18         embedding = create_embedding(movie["plot"])
19         update_operation = UpdateOne(
20             {"_id": movie["_id"]},
21             {"$set": {"embedding": embedding}},
22         )
23         movies_to_update.append(update_operation)
24 
25     if movies_to_update:
26         result = movies_collection.bulk_write(movies_to_update)
27         print(f"Updated {result.modified_count} movies")
28 
29     else:
30         print("No movies to update")

Now that the data is prepared, we add two more functions that we need to offer a nice REST service for our client application.

First, we want to be able to update the plot, which inherently means we need to update the embeddings again.

The update_plot is similar to the initial add_missing_embeddings function but a bit simpler since we only need to update one document.

1 def update_plot(title: str, plot: str) -> dict:
2     embedding = create_embedding(plot)
3 
4     result = movies_collection.find_one_and_update(
5         {"title": title},
6         {"$set": {"plot": plot, "embedding": embedding}},
7         return_document=True,
8     )
9 
10     return result

The other function we need to offer is the actual vector search. This can be done using the MongoDB Atlas aggregation pipeline that can be accessed via the Atlas driver.

The $vectorSearch stage needs to include the index name we want to use, the path to the embedding, and the information about how many results we want to get. This time, we only want to retrieve the title, so we add a $project stage to the pipeline. Make sure to use list to turn the cursor that the search returns into a python list.

1 def execute_vector_search(vector: [float]) -> list[dict]:
2     vector_search_query = {
3         "$vectorSearch": {
4             "index": VECTOR_SEARCH_INDEX_NAME,
5             "path": EMBEDDING_PATH,
6             "queryVector": vector,
7             "numCandidates": 10,
8             "limit": 5,
9         }
10     }
11     projection = {"$project": {"_id": 0, "title": 1}}
12     results = movies_collection.aggregate([vector_search_query, projection])
13     results_list = list(results)
14 
15     return results_list

Putting it all together: main.py

Now, we can put it all together. Let’s use Flask to expose a REST service for our client application.

1 from flask import Flask, request, jsonify
2 
3 from atlas import execute_vector_search, update_plot
4 from sagemaker import create_embedding
5 
6 app = Flask(__name__)

One route we want to expose is /movies/<title> that can be executed with a PUT operation to update the plot of a movie given the title. The title will be a query parameter while the plot is passed in via the body. This function is using the update_plot that we created before in atlas.py and returns the movie with its new plot on success.

1 @app.route("/movies/<title>", methods=["PUT"])
2 def update_movie(title: str):
3     try:
4         request_json = request.get_json()
5         plot = request_json["plot"]
6         updated_movie = update_plot(title, plot)
7 
8         if updated_movie:
9             return jsonify(
10                 {
11                     "message": "Movie updated successfully",
12                     "updated_movie": updated_movie,
13                 }
14             )
15         else:
16             return jsonify({"error": f"Movie with title {title} not found"}), 404
17 
18     except Exception as e:
19         return jsonify({"error": str(e)}), 500

The other endpoint, finally, is the vector search: /movies/search.

uma query is POST’ed to this endpoint which will then use create_embedding first to create a vector from this query. Note that we need to also create vectors for the query because that’s what the vector search needs to compare it to the actual data (or rather, its embeddings).

We then call execute_vector_search with this embedding to retrieve the results, which will be returned on success.

1 @app.route("/movies/search", methods=["POST"])
2 def search_movies():
3     try:
4         request_json = request.get_json()
5         query = request_json["query"]
6         embedding = create_embedding(query)
7 
8         results = execute_vector_search(embedding)
9 
10         jsonified_results = jsonify(
11             {
12                 "message": "Movies searched successfully",
13                 "results": results,
14             }
15         )
16 
17         return jsonified_results
18 
19     except Exception as e:
20         return jsonify({"error": str(e)}), 500
21 
22 
23 if __name__ == "__main__":
24     app.run(debug=True)

And that’s about all you have to do. Easy, wasn’t it?

Go ahead and run the Flask app (main.py) and when ready, send a cURL to see Atlas Vector Search in action. Here is an example when running it locally:

1 curl -X POST -H "Content-Type: application/json" -d '{"query": "A movie about the Earth, Mars and an invasion."}' http://127.0.0.1:5000/movies/search

This should lead to the following result:

1 {
2   "message": "Movies searched successfully",
3   "results": [
4     {
5       "title": "The War of the Worlds"
6     },
7     {
8       "title": "The 6th Day"
9     },
10     {
11       "title": "Pixels"
12     },
13     {
14       "title": "Journey to Saturn"
15     },
16     {
17       "title": "Moonraker"
18     }
19   ]
20 }

War of the Worlds — a movie about Earth, Mars, and an invasion. And what a great one, right?

That’s a wrap!

Of course, this is just a quick and short overview of how to use Amazon SageMaker to create vectors and then search via Vector Search.

We do have a full workshop for you to learn about all those parts in detail. Please visit the Search Lab GitHub page to learn more.

✅ Cadastre-se para um cluster gratuito.

✅ Já tem uma conta AWS? O Atlas aceita pagamentos por uso por meio do AWS Marketplace (AWS MP) sem nenhum compromisso inicial — basta se inscrever no MongoDB Atlas por meio do AWS Marketplace.

✅ Obtenha ajuda em nossos fóruns da comunidade.

Principais comentários nos fóruns

Ainda não há comentários sobre este artigo.

Iniciar a conversa

Avaliar este tutorial

Parte de uma série

Pesquisa de vetores com o MongoDB Atlas e o Amazon SageMaker

Mais nesta série

Parte #1: Crie seu próprio Vector Search com o MongoDB Atlas e o Amazon SageMaker
Parte #2: criar um modelo de endpoint com o Amazon SageMaker, o AWS Lambda e o AWS API Gateway

Relacionado

Tutorial

Migrar do Azure CosmosDB para o MongoDB Atlas usando o Apache Kafka

May 09, 2022 | 3 min read

Tutorial

Adicione abreviações postais dos EUA à sua pesquisa do Atlas em 5 minutos

Sep 29, 2022 | 9 min read

Tutorial

Desenvolvendo seus aplicativos de forma mais eficiente com as instâncias sem servidor do MongoDB Atlas

Feb 03, 2023 | 7 min read

Artigo

Crie um site de boletim informativo com a plataforma de dados MongoDB

Sep 09, 2024 | 9 min read

Sumário

Creating a MongoDB cluster and loading the sample data
Creating a Vector Search index
Preparing embeddings
Accessing the endpoint: sagemaker.py
Access and searching the data: atlas.py
Putting it all together: main.py
That’s a wrap!

1	{
2	"fields": [
3	{
4	"numDimensions": 384,
5	"path": "embedding",
6	"similarity": "cosine",
7	"type": "vector"
8	}
9	]
10	}

1	MONGODB_CONNECTION_STRING="mongodb+srv://<username>:<password>@<cluster_id>.mongodb.net/?retryWrites=true&w=majority"
2	EMBEDDING_SERVICE="https://<your_endpoint>.amazonaws.com/TEST/sageMakerResource"

1	import os
2	from typing import Optional
3	from urllib.parse import quote
4
5	import requests
6	from dotenv import load_dotenv
7
8	load_dotenv()
9
10	EMBEDDING_SERVICE = os.environ.get("EMBEDDING_SERVICE")

1	def create_embedding(plot: str) -> Optional[float]:
2	encoded_plot = quote(plot)
3	embedding_url = f"{EMBEDDING_SERVICE}?query={encoded_plot}"
4
5	embedding_response = requests.get(embedding_url)
6	embedding_vector = embedding_response.json()["embedding"]
7
8	return embedding_vector

1	import os
2
3	from dotenv import load_dotenv
4	from pymongo import MongoClient, UpdateOne
5
6	from sagemaker import create_embedding
7
8	load_dotenv()
9
10	MONGODB_CONNECTION_STRING = os.environ.get("MONGODB_CONNECTION_STRING")
11	DATABASE_NAME = "sample_mflix"
12	COLLECTION_NAME = "embedded_movies"
13	VECTOR_SEARCH_INDEX_NAME = "VectorSearchIndex"
14	EMBEDDING_PATH = "embedding"
15	mongo_client = MongoClient(MONGODB_CONNECTION_STRING)
16	database = mongo_client[DATABASE_NAME]
17	movies_collection = database[COLLECTION_NAME]

1	def add_missing_embeddings():
2	movies_with_a_plot_without_embedding_filter = {
3	"$and": [
4	{"plot": {"$exists": True, "$ne": ""}},
5	{"embedding": {"$exists": False}},
6	]
7	}
8	only_show_plot_projection = {"plot": 1}
9
10	movies = movies_collection.find(
11	movies_with_a_plot_without_embedding_filter,
12	only_show_plot_projection,
13	)
14
15	movies_to_update = []
16
17	for movie in movies:
18	embedding = create_embedding(movie["plot"])
19	update_operation = UpdateOne(
20	{"_id": movie["_id"]},
21	{"$set": {"embedding": embedding}},
22	)
23	movies_to_update.append(update_operation)
24
25	if movies_to_update:
26	result = movies_collection.bulk_write(movies_to_update)
27	print(f"Updated {result.modified_count} movies")
28
29	else:
30	print("No movies to update")

1	def update_plot(title: str, plot: str) -> dict:
2	embedding = create_embedding(plot)
3
4	result = movies_collection.find_one_and_update(
5	{"title": title},
6	{"$set": {"plot": plot, "embedding": embedding}},
7	return_document=True,
8	)
9
10	return result

1	def execute_vector_search(vector: [float]) -> list[dict]:
2	vector_search_query = {
3	"$vectorSearch": {
4	"index": VECTOR_SEARCH_INDEX_NAME,
5	"path": EMBEDDING_PATH,
6	"queryVector": vector,
7	"numCandidates": 10,
8	"limit": 5,
9	}
10	}
11	projection = {"$project": {"_id": 0, "title": 1}}
12	results = movies_collection.aggregate([vector_search_query, projection])
13	results_list = list(results)
14
15	return results_list

1	from flask import Flask, request, jsonify
2
3	from atlas import execute_vector_search, update_plot
4	from sagemaker import create_embedding
5
6	app = Flask(__name__)

1	@app.route("/movies/<title>", methods=["PUT"])
2	def update_movie(title: str):
3	try:
4	request_json = request.get_json()
5	plot = request_json["plot"]
6	updated_movie = update_plot(title, plot)
7
8	if updated_movie:
9	return jsonify(
10	{
11	"message": "Movie updated successfully",
12	"updated_movie": updated_movie,
13	}
14	)
15	else:
16	return jsonify({"error": f"Movie with title {title} not found"}), 404
17
18	except Exception as e:
19	return jsonify({"error": str(e)}), 500

1	@app.route("/movies/search", methods=["POST"])
2	def search_movies():
3	try:
4	request_json = request.get_json()
5	query = request_json["query"]
6	embedding = create_embedding(query)
7
8	results = execute_vector_search(embedding)
9
10	jsonified_results = jsonify(
11	{
12	"message": "Movies searched successfully",
13	"results": results,
14	}
15	)
16
17	return jsonified_results
18
19	except Exception as e:
20	return jsonify({"error": str(e)}), 500
21
22
23	if __name__ == "__main__":
24	app.run(debug=True)

1	{
2	"message": "Movies searched successfully",
3	"results": [
4	{
5	"title": "The War of the Worlds"
6	},
7	{
8	"title": "The 6th Day"
9	},
10	{
11	"title": "Pixels"
12	},
13	{
14	"title": "Journey to Saturn"
15	},
16	{
17	"title": "Moonraker"
18	}
19	]
20	}