Quick Start 2: Vector Search With MongoDB and OpenAI
Rate this quickstart
This quick start will guide you through how to perform vector search using MongoDB Atlas and OpenAI API.
Code (Python notebook): View on Github or Open in Colab
- Creating a vector index on Atlas
- Performing vector search using OpenAI embeddings
- A free Atlas account — create one now!
- A Python Jupyter notebook environment — we recommend Google Colab. It is a free, cloud-based environment and very easy to get up and running.
You may find this quick start helpful in getting Atlas and a Python client running:
Getting Started with MongoDB Atlas and Python.
In the realm of information retrieval, keyword search has long been the standard. This method involves matching exact words within texts to find relevant information. For instance, if you're trying to locate a film but can only recall that its title includes the word "battle," a keyword search enables you to filter through content to find matches.
However, what if your memory of a movie is vague, limited to a general plot or theme rather than specific titles or keywords? This is where vector search steps in, revolutionizing how we find information. Unlike keyword search, vector search delves into the realm of semantics, allowing for the retrieval of content based on the meanings behind the words.
Consider you're trying to find a movie again, but this time, all you remember is a broad plot description like "humans fight aliens." Traditional search methods might leave you combing through endless irrelevant results. Vector search, however, uses advanced algorithms to understand the contextual meaning of your query, capable of guiding you to movies that align with your description — such as "Terminator" — even if the exact words aren't used in your search terms.
Let's understand how all the pieces fit together.
We are going to use the embedded_movies collection in the Atlas sample data. This one already has embeddings calculated for plots, making our lives easier.
Here is how it all works. When a semantic search query is issued (e.g., "fatalistic sci-fi movies"):
- Steps 1 and 2: We call the OpenAI API to get embeddings for the query text.
- Step 3: Send the embedding to Atlas to perform a vector search.
- Step 4: Atlas returns relevant search results using Vector Search.
Here is a visual:
Embeddings are an interesting way of transforming different types of data — whether it's text, images, audio, or video — into a numerical format, specifically, into an array known as a “vector.” This conversion allows the data to be processed and understood by machines.
Take text data as an example: Words can be converted into numbers, with each unique word assigned its own distinct numerical value. These numerical representations can vary in size, ranging anywhere from 128 to 4096 elements.
However, what sets embeddings apart is their ability to capture more than just random sequences of numbers. They actually preserve some of the inherent meaning of the original data. For instance, words that share similar meanings tend to have embeddings that are closer together in the numerical space.
To illustrate, consider a simplified scenario where we plot the embeddings of several words on a two-dimensional graph for easier visualization. Though in practice, embeddings can span many dimensions (from 128 to 4096), this example helps clarify the concept. On the graph, you'll notice that items with similar contexts or meanings — like different types of fruits or various pets — are positioned closer together. This clustering is a key strength of embeddings, highlighting their ability to capture and reflect the nuances of meaning and similarity within the data.
So, how do we go about creating these useful embeddings? Thankfully, there's a variety of embedding models out there designed to transform your text, audio, or video data into meaningful numerical representations.
Some of these models are proprietary, meaning they are owned by certain companies and accessible mainly through their APIs. OpenAI is a notable example of a provider offering such models.
There are also open-source models available. These can be freely downloaded and operated on your own computer. Whether you opt for a proprietary model or an open-source option depends on your specific needs and resources.
Hugging Face's embedding model leaderboard is a great place to start looking for embedding models. They periodically test available embedding models and rank them according to various criteria.
You can read more about embeddings:
- Explore some of the embedding choices: RAG Series Part 1: How to Choose the Right Embedding Model for Your Application, by Apoorva Joshi
Here is a quick guide adopted from the official documentation. Refer to the documentation for full details.
Sign up for Atlas and log into your account.
- You can choose any cloud instance.
- Choose the “FREE” tier, so you won't incur any costs.
- Follow the setup wizard and give your instance a name.
- Note your username and password to connect to the instance.
- Configuring IP access: Add 0.0.0.0/0 to the IP access list. This makes it available to connect from Google Colab. (Note: This makes the instance available from any IP address, which is okay for a test instance). See the screenshot below for how to add the IP:
Next, we'll load the default sample datasets in Atlas, which may take a few minutes.
In the Atlas UI, explore the embedded_movies collection within the sample_mflix database to view document details like title, year, and plot.
Fortunately, the sample_mflix.embedded_movies dataset already includes vector embeddings for plots, generated with OpenAI's text-embedding-ada-002 model. By inspecting the plot_embedding attribute in the Atlas UI, as shown in the screenshot below, you'll find it comprises an array of 1536 numbers.
Congrats! You now have an Atlas cluster, with some sample data. 👏
Before we can run a vector search, we need to create a vector index. Creating an index allows Atlas to execute queries faster. Here is how to create a vector index.
Let's define a vector index as below. Here is what the parameters mean.
- "type": "vector" — This indicates we are defining a vector index.
- "path": "plot_embedding" — This is the attribute we are indexing — in our case, the embedding data of plot.
- "numDimensions": 1536 — This indicates the dimension of the embedding field. This has to match the embedding model we have used (in our case, the OpenAI model).
- "similarity": "dotProduct" — Finally, we are defining the matching algorithm to be used by the vector index. The choices are euclidean, cosine, and dotProduct. You can read more about these choices in How to Index Fields for Vector Search.
Index name: idx_plot_embedding
Index definition
1 { 2 "fields": [ 3 { 4 "type": "vector", 5 "path": "plot_embedding", 6 "numDimensions": 1536, 7 "similarity": "dotProduct" 8 } 9 ] 10 }
Wait until the index is ready to be used
We will start by setting the following configuration parameters:
- Atlas connection credentials — see below for a step-by-step guide.
- OpenAI API key — get it from the OpenAI dashboard.
Here is how you get the ATLAS_URI setting.
- Navigate to the Atlas UI.
- Select your database.
- Choose the “Connect” option to proceed.
- Within the connect section, click on “Drivers” to view connection details.
- Finally, copy the displayed ATLAS_URI value for use in your application's configuration.
See these screenshots as guidance.
Now, let's look at the code. We will walk through and execute the code step by step. You can also access the fully functional Python notebook at the beginning of this guide.
Start by setting up configurations for ATLAS_URI and OPENAI_API_KEY.
(Run this code block in your Google Colab under Step 3.)
1 # We will keep all global variables in an object to not pollute the global namespace. 2 class MyConfig(object): 3 pass 4 5 MY_CONFIG = MyConfig() 6 7 MY_CONFIG.ATLAS_URI = "Enter your Atlas URI value here" ## TODO 8 MY_CONFIG.OPENAI_API_KEY = "Enter your OpenAI API Key here" ## TODO
Pro tip 💡
We will keep all global variables in an object called MY_CONFIG so as not to pollute the global namespace. MyConfig is just a placeholder class to hold our variables and settings.
Let's install the dependencies required. We are installing two packages:
- pymongo: Python library to connect to MongoDB Atlas instances
- openai: For calling the OpenAI library
(Run this code block in your Google Colab under Step 4.)
1 !pip install openai==1.13.3 pymongo==4.6.2
Pro tip 💡
You will notice that we are specifying a version (openai==1.13.3) for packages we are installing. This ensures the versions we are installing are compatible with our code. This is a good practice and is called version pinning or freezing.
AtlasClient
This class handles establishing connections, running queries, and performing a vector search on MongoDB Atlas.
(Run this code block in your Google Colab under Step 5.)
1 from pymongo import MongoClient 2 3 4 class AtlasClient (): 5 6 7 def __init__ (self, altas_uri, dbname): 8 self.mongodb_client = MongoClient(altas_uri) 9 self.database = self.mongodb_client[dbname] 10 11 12 ## A quick way to test if we can connect to Atlas instance 13 def ping (self): 14 self.mongodb_client.admin.command('ping') 15 16 17 def get_collection (self, collection_name): 18 collection = self.database[collection_name] 19 return collection 20 21 22 def find (self, collection_name, filter = {}, limit=10): 23 collection = self.database[collection_name] 24 items = list(collection.find(filter=filter, limit=limit)) 25 return items 26 27 28 # https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/ 29 def vector_search(self, collection_name, index_name, attr_name, embedding_vector, limit=5): 30 collection = self.database[collection_name] 31 results = collection.aggregate([ 32 { 33 '$vectorSearch': { 34 "index": index_name, 35 "path": attr_name, 36 "queryVector": embedding_vector, 37 "numCandidates": 50, 38 "limit": limit, 39 } 40 }, 41 ## We are extracting 'vectorSearchScore' here 42 ## columns with 1 are included, columns with 0 are excluded 43 { 44 "$project": { 45 '_id' : 1, 46 'title' : 1, 47 'plot' : 1, 48 'year' : 1, 49 "search_score": { "$meta": "vectorSearchScore" } 50 } 51 } 52 ]) 53 return list(results) 54 55 56 def close_connection(self): 57 self.mongodb_client.close()
Initializing class:
The constructor (init) function takes two arguments:
ATLAS URI (that we obtained from settings)
Database to connect
Ping:
This is a handy method to test if we can connect to Atlas.
find
This is the “search” function. We specify the collection to search and any search criteria using filters.
vector_search
This is a key function that performs vector search on MongoDB Atlas. It takes the following parameters:
- collection_name: embedded_movies
- index_name: idx_plot_embedding
- attr_name: "plot_embedding"
- embedding_vector: Embeddings returned from the OpenAI API call
- limit: How many results to return
The $project section extracts the attributes we want to return as search results.
(This code block is for review purposes. No need to execute.)
1 results = collection.aggregate([ 2 { 3 '$vectorSearch': { 4 "index": index_name, 5 "path": attr_name, 6 "queryVector": embedding_vector, 7 "numCandidates": 50, 8 "limit": limit, 9 } 10 }, 11 ## We are extracting 'vectorSearchScore' here 12 ## columns with 1 are included, columns with 0 are excluded 13 { 14 "$project": { 15 '_id' : 1, 16 'title' : 1, 17 'plot' : 1, 18 'year' : 1, 19 "search_score": { "$meta": "vectorSearchScore" } 20 } 21 } 22 ])
Also, note this line:
1 "search_score": { "$meta": "vectorSearchScore" }
This particular line extracts the search score of the vector search. The search score ranges from 0.0 to 1.0. Scores close to 1.0 are a great match.
This is a handy class for OpenAI interaction.
(Run this code block in your Google Colab under Step 5.)
1 from openai import OpenAI 2 3 4 class OpenAIClient(): 5 def __init__(self, api_key) -> None: 6 self.client = OpenAI( 7 api_key= api_key, # defaults to os.environ.get("OPENAI_API_KEY") 8 ) 9 # print ("OpenAI Client initialized!") 10 11 12 13 14 def get_embedding(self, text: str, model="text-embedding-ada-002") -> list[float]: 15 text = text.replace("\n", " ") 16 resp = self.client.embeddings.create ( 17 input=[text], 18 model=model ) 19 20 21 return resp.data[0].embedding
Initializing class:
This class is initialized with the OpenAI API key.
get_embedding method:
- text: This is the text we are trying to get embeddings for.
- model: This is the embedding model. Here we are specifying the model text-embedding-ada-002 because this is the model that is used to create embeddings in our sample data. So we want to use the same model to encode our query string.
Initialize the Atlas client and do a quick connectivity test. We are connecting to the sample_mflix database and the embedded_movies collection. This dataset is loaded as part of the setup (Step 1).
If everything goes well, the connection will succeed.
(Run this code block in your Google Colab under Step 6.)
1 MY_CONFIG.DB_NAME = 'sample_mflix' 2 MY_CONFIG.COLLECTION_NAME = 'embedded_movies' 3 MY_CONFIG.INDEX_NAME = 'idx_plot_embedding' 4 5 6 atlas_client = AtlasClient (MY_CONFIG.ATLAS_URI, MY_CONFIG.DB_NAME) 7 atlas_client.ping() 8 print ('Connected to Atlas instance! We are good to go!')
Troubleshooting
If you get a “connection failed” error, make sure 0.0.0.0/0 is added as an allowed IP address to connect (see Step 1).
Initialize the OpenAI client with the OpenAI API key.
(Run this code block in your Google Colab under Step 7.)
1 openAI_client = OpenAIClient (api_key=MY_CONFIG.OPENAI_API_KEY) 2 print ("OpenAI client initialized")
Now that we have everything set up, let's do a vector search! We are going to query movie plots, not just based on keywords but also meaning. For example, we will search for movies where the plot is "humans fighting aliens."
This function takes one argument: query string.
- We convert the query into embeddings. We do this by calling the OpenAI API. We also time the API call (t1b - t1a) so we understand the network latencies.
- We send the embeddings (we just got back from OpenAI) to Atlas to perform a vector search and get the results.
- We are printing out the results returned by the vector search.
(Run this code block in your Google Colab under Step 8.)
1 import time 2 3 # Handy function 4 def do_vector_search (query:str) -> None: 5 query = query.lower().strip() # cleanup query string 6 print ('query: ', query) 7 8 9 # call openAI API to convert text into embedding 10 t1a = time.perf_counter() 11 embedding = openAI_client.get_embedding(query) 12 t1b = time.perf_counter() 13 print (f"Getting embeddings from OpenAI took {(t1b-t1a)*1000:,.0f} ms") 14 15 16 # perform a vector search on Atlas 17 # using embeddings (returned from OpenAI above) 18 t2a = time.perf_counter() 19 movies = atlas_client.vector_search(collection_name=MY_CONFIG.COLLECTION_NAME, index_name=MY_CONFIG.INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 ) 20 t2b = time.perf_counter() 21 22 23 # and printing out the results 24 print (f"Altas query returned {len (movies)} movies in {(t2b-t2a)*1000:,.0f} ms") 25 print() 26 27 28 for idx, movie in enumerate (movies): 29 print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}' + 30 f'\nsearch_score(meta):{movie["search_score"]}\nplot: {movie["plot"]}\n')
Here is our first query. We want to find movies where the plot is about "humans fighting aliens."
(Run this code block in your Google Colab under Step 8.)
1 query="humans fighting aliens" 2 do_vector_search (query=query)
We will see search results like this:
1 query: humans fighting aliens 2 using cached embeddings 3 Altas query returned 10 movies in 138 ms 4 5 6 1 7 id: 573a1398f29313caabce8f83 8 title: V: The Final Battle, 9 year: 1984 10 search_score(meta):0.9573556184768677 11 plot: A small group of human resistance fighters fight a desperate guerilla war against the genocidal extra-terrestrials who dominate Earth. 12 13 14 2 15 id: 573a13c7f29313caabd75324 16 title: Falling Skies, 17 year: 2011è 18 search_score(meta):0.9550596475601196 19 plot: Survivors of an alien attack on earth gather together to fight for their lives and fight back. 20 21 22 3 23 id: 573a139af29313caabcf0cff 24 title: Starship Troopers, 25 year: 1997 26 search_score(meta):0.9523435831069946 27 plot: Humans in a fascistic, militaristic future do battle with giant alien bugs in a fight for survival. 28 29 30 ... 31 year: 2002 32 search_score(meta):0.9372057914733887 33 plot: A young woman from the future forces a local gunman to help her stop an impending alien invasion which will wipe out the human race.
Note the score
In addition to movie attributes (title, year, plot, etc.), we are also displaying search_score. This is a meta attribute — not really part of the movies collection but generated as a result of the vector search.
This is a number between 0 and 1. Values closer to 1 represent a better match. The results are sorted from best match down (closer to 1 first). Read more about search score.
Troubleshooting
No search results?
Make sure the vector search index is defined and active (Step 2)!
(Run this code block in your Google Colab under Step 8.)
1 query="relationship drama between two good friends" 2 do_vector_search (query=query)
Sample results will look like the following:
1 query: relationship drama between two good friends 2 using cached embeddings 3 Altas query returned 10 movies in 71 ms 4 5 6 1 7 id: 573a13a3f29313caabd0dfe2 8 title: Dark Blue World, 9 year: 2001 10 search_score(meta):0.9380425214767456 11 plot: The friendship of two men becomes tested when they both fall for the same woman. 12 13 14 2 15 id: 573a13a3f29313caabd0e14b 16 title: Dark Blue World, 17 year: 2001 18 search_score(meta):0.9380425214767456 19 plot: The friendship of two men becomes tested when they both fall for the same woman. 20 21 22 3 23 id: 573a1399f29313caabcec488 24 title: Once a Thief, 25 year: 1991 26 search_score(meta):0.9260045289993286 27 plot: A romantic and action packed story of three best friends, a group of high end art thieves, who come into trouble when a love-triangle forms between them. 28 29 30 ... 31 year: 1987 32 search_score(meta):0.9181452989578247 33 plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
There we go! We have successfully performed a vector search combining Atlas and the OpenAI API.
To summarize, in this quick start, we have accomplished the following:
- Set up Atlas in the cloud
- Loaded sample data into our Atlas cluster
- Set up a vector search index
- Performed a vector search using OpenAI embeddings and Atlas
As we can see, vector search is very powerful as it can fetch results based on the semantic meaning of search terms instead of just keyword matching. Vector search allows us to build more powerful applications.
Here are some suggested resources for you to explore:
Top Comments in Forums
There are no comments on this article yet.
Related
Article
Atlas Multi-Cloud Global Cluster: Always Available, Even in the Apocalypse!
Sep 23, 2022 | 4 min read
Article
How to Work With Johns Hopkins University COVID-19 Data in MongoDB Atlas
Sep 09, 2024 | 8 min read