Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Confessions of a PyMongoArrowholic: Using Atlas Vector Search and PyMongoArrow to Semantically Search Through Luxury Fashion Items

Anaiya Raisinghani9 min read • Published Aug 09, 2024 • Updated Aug 09, 2024
AIPandasPythonMongoDB
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
As a twenty-something-year-old living in New York City, online shopping is my second favorite hobby. What is my first, you ask? Figuring out ways to optimize my shopping addiction so I can spend fewer hours scrolling.
Anyone who fantasizes about luxury (it’s called manifestation, people!) knows about Net-A-Porter and all the incredible pieces the website offers. While my normal approach is sorting by price from low to high, I’d be lying if I said it isn’t incredibly fun to see the full scope of what’s out there. So, let’s use a fun dataset that holds last season's Net-A-Porter items, and semantically search to explore some of the most expensive items, from any brand, any category, and with natural language queries.
In this tutorial, we are going to be using MongoDB Atlas, the PyMongoArrow library, MongoDB Atlas Vector Search, and a luxury fashion dataset from Kaggle.
Before we dive in, let’s first cover some of the important aspects of what will help us achieve our overall outcome.

What is PyMongoArrow?

PyMongoArrow is a Python library for data analysis with MongoDB. Due to our dataset being a .csv file, we are going to be reading it using the Pandas library, so it’ll be read in as a Pandas dataframe. With the pymongoarrow library, we can export all our data to MongoDB Atlas in the most ideal format for our tutorial with a handful of easy steps. It’s built on top of pymongo, so it allows us to work with MongoDB data in a super easy and performant manner. As you work through this tutorial, you’ll see how simple transferring your data and configuring it becomes when using the pymongoarrow library, an issue many data developers have dealt with in the past.

What is MongoDB Atlas Vector Search?

MongoDB Atlas Vector Search has truly revolutionized search capabilities. It lets you easily search semantically through your database while keeping your vector embeddings in the same place as your source data. Searching semantically means to search by meaning, so instead of having to search using exact keywords, we can query and receive results that convey the same idea without the precise wording.
For example, instead of searching through our sample size using simple queries such as “dress,” we can actually use phrases or generalities, like “summer beach tropical” or just even “summer.” We are going to be utilizing the $vectorSearch aggregation stage in this tutorial, which simplifies using Atlas Vector Search even more.
Let’s get started!

Pre-requisites

  1. IDE of your choosing — this tutorial uses Google Colab. Feel free to run the commands directly in the notebook.
  2. A MongoDB Atlas account
  3. A MongoDB Atlas cluster — the free tier works perfectly for this tutorial.
  4. Dataset from Kaggle — please ensure you’re downloading the correct .csv file.
  5. An OpenAI API key — this is how we will be embedding our data prior to uploading it into MongoDB Atlas.
Once your cluster has been created and you’ve downloaded the dataset locally, you’re ready to begin!

Upload our .csv file

Our first step is to upload our .csv file into Google Colab. On the left-hand side of Google Colab, access the "Files" section. Select the downloaded net-a-porter.csv file and upload. Uploaded file in Google Colab
Once your file is uploaded, we need to do two important things:
  1. We need to use OpenAI to create embeddings on each item in our file.
  2. We need to clean up our dataset and reconfigure it into a format best suited for our end goal, which is to ensure we can use semantic search to find items in our database.

Configure OpenAI

If you take a look at your .csv file, you’ll notice it consists of four columns (brand, description, price_usd, type) and a multitude of rows. Our .csv file
We need to add a column in our dataframe that contains the embeddings for our item descriptions. To embed each item description, we will use the “text-embedding-3-small” embedding model, and this embedding function:
1def get_embedding(text):
2 embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
3 return embedding
We also need to use the pandas library to work with the data we have on hand. Pandas is a Python library used for working with data sets, and it’s super crucial when analyzing, cleaning, exploring, and manipulating data.
To start off, we want to install our dependencies. This means installing openai and importing pandas.
1!pip install openai
2import pandas as pd
3import openai
Now, we need to grab our OpenAI secret key. Make sure to save your key somewhere safe and do not share it anywhere, as it’s very sensitive. In this tutorial, to keep things simple and demonstrate other functionalities available, we are hard-coding in our API key, but in production or anywhere else, it’s important to store your sensitive values in a .env file.
Copy in your key and the embedding function from above:
1# we need our openai secret key
2openai.api_key = 'OPENAI-APIKEY'
3
4
5# we are going to use this embedding model for text
6EMBEDDING_MODEL = "text-embedding-3-small"
7
8
9# this is the embedding function we will use to create our embeddings for the descriptions
10def get_embedding(text):
11 embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
12 return embedding
We’ve set ourselves up for success with processing our embeddings, so let’s go ahead and configure our dataframe!
First, we want to read in the file that was just uploaded:
1# we want to read in our file that we just uploaded
2df = pd.read_csv('/content/net-a-porter.csv')
We only want to focus on the first three columns, since we don’t necessarily need the type column, so let’s drop it:
1# drop the 'type' column
2df.drop(columns=['type'], inplace=True)
It’s important to make sure that the columns we are dealing with are clean and don’t have any null values. This is crucial because null values can mess up our data in the long run, and it’s a good practice to always ensure you’re working with a clean dataset. To do this, use dropna:
1# this is just saying to drop null values from our specific subsets, we want 'brand', price_usd', and 'description'
2df.dropna(subset=['brand', 'price_usd', 'description'], inplace=True)
To ensure we are not spending a ton of money and time embedding each and every description in our large dataset, let’s slice it down to 100 rows. This will still provide us with an interesting sample size, but it won’t take up too many resources:
1# cut our entire file down to just 100 rows so that we don't spend a million dollars embedding the file
2df = df.head(100)
Now, we’re ready to create a new column for where our embeddings will go, and then we can print out our first 20 rows just to ensure we’re on the right track:
1# this is creating a new column with the embeddings
2df["description_embedding"] = df['description'].apply(get_embedding)
3
4
5# this is going to give us the first twenty rows from our file
6print(df.head(20))
This should be your output, with the new description_embedding column: Our data and newly included description_embedding column
As you can see, we have a dataframe with the columns we need, and specifically, our newly included description_embedding column! Let’s make sure we can save this into our cluster so we can use MongoDB Atlas Vector Search when we’re ready to do so.

Import data into Atlas using PyMongoArrow

Since pymongoarrow uses Apache Arrow behind the scenes, to move our data into MongoDB Atlas, we need to convert our Pandas dataframe into an Arrow table. The great part of Arrow tables is that they allow for nested columns, so if we had a more complicated dataset, we wouldn’t need to jump through too many hoops to accommodate nesting.
Now that we have all our items and embeddings, let’s use pymongoarrow to import all of our data into MongoDB Atlas. Use a pip command to install pymongo, pymongoarrow, and pyarrow.
1!pip install pymongo pymongoarrow pyarrow
Once that succeeds, we can sort our items by most expensive to least expensive (just for fun) and then we can import all our items into our cluster. Please ensure you have your MongoDB connection string on hand so you can connect to your cluster and do this step. While we are hard-coding this in for this tutorial, please keep in mind that it’s not secure and variables should always be stored in a separate file.
Copy the code below to do this:
1from pymongo import MongoClient
2from pymongoarrow.api import write
3import pyarrow as pa
4
5
6# I want to sort by most expensive item to least expensive
7df = df.sort_values(by=['price_usd'], ascending=False)
8
9
10# this is your connection to your cluster
11connection_string = "MONGODB-CONNECTION-STRING"
12client = MongoClient(connection_string)
13
14
15# you can name your database and collection anything you like
16database = client['net-a-porter']
17collection = database['average_prices_descending']
18
19
20# in order to save our data, we need to first convert our Pandas DataFrame to an Arrow Table using pyarrow
21arrow_table = pa.Table.from_pandas(df)
22
23
24write(collection, arrow_table)
25
26
27print("Successful")
28print(arrow_table)
Once you run this code block, be sure to double-check in MongoDB Atlas that everything looks as expected. The rows from your .csv file will have been transformed into separate documents, with each column as a new field. Make sure that your new description_embedding field is included as well!
Our data has been properly imported into MongoDB Atlas
Now that we have our embedded documents in place, we can set up MongoDB Atlas Vector Search.
Let’s start searching semantically through our newly imported data. We first need to create a Vector Search index. To do this, head into your Atlas account and follow the steps.
Once finished, it should look like this.
1{
2 "fields": [
3 {
4 "numDimensions": 1536,
5 "path": "description_embedding",
6 "similarity": "cosine",
7 "type": "vector"
8 }
9 ]
10}
The path we are using is description_embedding since we want our Vector Search index to be used against our newly incorporated embedding column. For the similarity field, we are choosing “euclidean,” but depending on your use case, you can either use “cosine” or “dot-product.”
Keep your “Index Name” as “vector_index,” or change it to something that you’ll remember, but make sure you’ve selected the correct database and collection. Once you’ve saved your index and it’s uploaded, you’ll know it’s active when the status looks like this. Active Vector Search index
Keep in mind that your Vector Search index is isolated to MongoDB Atlas. It is not a part of your overall Python script and you should not be running the index in your script.
Now, go back to the Google Colab file. To search semantically, we need to embed our queries. This is a very important part: When we are using semantic search, we are not comparing vectors to text — we are comparing vectors to vectors! Do this with these couple of lines:
1# my query
2query_description = "summer"
3
4
5# we need to embed the query as well, since our documents are embedded
6query_vector = get_embedding(query_description)
Since we’ve already used the embedding model above, it doesn’t take much work to embed our queries as well.
Now, we need to define the aggregation pipeline so that we can semantically search. We can do this using $vectorSearch. The pipeline looks like this:
1# write the aggregation pipeline
2pipeline = [
3 {
4 '$vectorSearch': {
5 'index': 'vector_index',
6 'path': 'description_embedding',
7 'queryVector': query_vector,
8 # I only had 100 rows saved so that it's easier to use OpenAI
9 'numCandidates': 100,
10 'limit': 5
11 }
12 },
13 {
14 '$project': {
15 # i do not need to see the ID, but I do want to see my other columns.
16 '_id': 0,
17 'brand': 1,
18 'description': 1,
19 'price_usd': 1,
20 'score': {
21 '$meta': 'vectorSearchScore'
22 }
23 }
24 },
25 {
26 '$sort': {
27 'price_usd': -1 # sort by most expensive to least expensive
28 }
29 }
30]
As you can see, we have used the $project feature to only show the fields that we want. We’ve also used $vectorSearch to define the index, the path, and our query vector. Double-check to ensure all the fields are correct before you proceed. Otherwise, it will not run.
Once your pipeline has been written, define which database and collection you want it to run on, and then print your results:
1# the pipeline is run on this database and collection
2database = client['net-a-porter']
3collection = database['average_prices_descending']
4result = collection.aggregate(pipeline)
5
6
7
8
9for clothing in result:
10 print(clothing)
In this tutorial we used the simple query of “summer” and these are our results: Our results with the query: “summer”
It’s interesting here because when I queried on “summer,” items that included summer months showed up, such as the month of August.
Let’s change our query to say “winter” and see the results. As you can see, out of our sample size, we are pulling up results that are oriented toward colder weather, such as coats, ski jackets, and wool pants. Output of our query “winter”
They are also sorted in descending order from most expensive to least (to dream!) and we can search through the items with limited scrolling. So, if you’re ever in ultimate lounging mode and need a cashmere-hoodie-and-sweatpants ‘fit that’ll set you back almost $1300 (before tax), you know where to look.

Next steps

While this tutorial was done using a flat dataset, once you truly understand the concepts around how to incorporate the platforms and libraries introduced, feel free to create a web scraper and try this same method on live data.

Conclusion

This tutorial gives you a great overview of what is possible with PyMongoArrow and MongoDB Atlas Vector Search. We were able to take a dataset, process it using Pandas, generate necessary embeddings with OpenAI, store our newly developed Arrow table into MongoDB Atlas using PyMongoArrow, and then semantically query on our database.
For more information on PyMongoArrow, please visit the documentation, and for more information on MongoDB Atlas Vector Search, explore the tutorial. If you have questions or want to share your work, join us in the MongoDB Developer Community.
Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Quickstart

Getting Started with Aggregation Pipelines in Rust


Oct 01, 2024 | 15 min read
Article

MongoDB Performance Tuning Questions


Sep 23, 2022 | 10 min read
Tutorial

Update Array Elements in a Document with MQL Positional Operators


Feb 03, 2023 | 6 min read
Tutorial

Building with Patterns: The Subset Pattern


Sep 23, 2022 | 3 min read
Table of Contents