How to Create Vector Embeddings
On this page
Vector embeddings represent your data as points in multi-dimensional space. These embeddings capture meaningful relationships in your data and enable tasks such as semantic search and retrieval. You can store vector embeddings along with your other data in Atlas and use Atlas Vector Search to query your vectorized data.
To perform vector search over your embeddings, do the following:
Choose a method to create vector embeddings.
Create vector embeddings from your data and store them in Atlas.
Create a vector embedding that represents your search query and run the query.
Atlas Vector Search returns documents whose vector embeddings are closest in distance to the embedding that represents your query. This indicates that they are similar in meaning.
Choose a Method to Create Embeddings
To create vector embeddings, you must use an embedding model. To connect to an embedding model and create embeddings for Atlas Vector Search, use one of the following methods:
Method | Description | Tutorials |
---|---|---|
Load an open-source model | If you don't have API keys or credits with an embedding service,
use an open-source embedding model by loading it locally
from your application. | |
Call an embedding service | Most AI providers offer APIs for their proprietary embedding models
that you can use to create vector embeddings. | |
Use an integration | You can integrate Atlas Vector Search with open-source frameworks like LangChain and LlamaIndex, services like Amazon Bedrock, and more. These integrations include built-in libraries and tools to help you quickly connect to open-source and proprietary embedding models and generate vector embeddings for Atlas Vector Search. |
Create and Store Embeddings From Data
The following procedure shows how to create vector embeddings and store them in Atlas by using an open-source or proprietary embedding model.
➤ Use the Select your language drop-down menu to set the language of the examples on this page.
Prerequisites
To run these examples, you must have the following:
An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later. Ensure that your IP address is included in your Atlas project's access list.
A terminal and code editor to run your Node.js project.
npm and Node.js installed.
An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later. Ensure that your IP address is included in your Atlas project's access list.
An environment to run interactive Python notebooks such as Colab.
Procedure
Complete the following steps to create vector embeddings from custom data and store them in a collection in Atlas. You can also modify the following code to generate embeddings from your own data of any type.
Note
If you want to create embeddings for an existing collection, you must add a new field that contains the embedding and update each document in the collection.
Update your package.json
file.
Configure your project to use ES modules
by adding "type": "module"
to your package.json
file
and then saving it.
{ "type": "module", // other fields... }
Create vector embeddings and store them in Atlas.
Create a file named
create-embeddings.js
and paste the following code. This code does the following:Creates and exports an asynchronous function named
getEmbedding
to generate an embedding for a given input. This function also specifies:The
feature-extraction
task from Hugging Face's transformers.js library. To learn more, see Tasks.The nomic-embed-text-v1 embedding model.
Generates embeddings from the sample text data and ingests them into the
sample_db.embeddings
collection in Atlas by using the MongoDB Node.js Driver.
create-embeddings.jsimport { MongoClient } from 'mongodb'; import { pipeline } from '@xenova/transformers'; // Function to generate embeddings for a given data source export async function getEmbedding(data) { const embedder = await pipeline( 'feature-extraction', 'Xenova/nomic-embed-text-v1'); const response = await embedder(data, { pooling: 'mean', normalize: true }); return Array.from(response.data); } // Data to embed const data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] async function run() { // Connect to your Atlas cluster const uri = "<connectionString>"; const client = new MongoClient(uri); try { await client.connect(); const db = client.db("sample_db"); const collection = db.collection("embeddings"); // Ingest data and embeddings into Atlas await Promise.all(data.map(async text => { const embedding = await getEmbedding(text); // Check if the document already exists before inserting const existingDoc = await collection.findOne({ text: text }); if (!existingDoc) { await collection.insertOne({ text: text, embedding: embedding }); console.log(embedding); } })); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir); Replace
<connectionString>
with your Atlas cluster's SRV connection string.Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net Save the file, then run the following command:
node create-embeddings.js [ -0.04323853924870491, -0.008460805751383305, 0.012494648806750774, -0.013014335185289383, ... ] [ -0.017400473356246948, 0.04922063276171684, -0.002836339408531785, -0.030395228415727615, ... ] [ -0.016950927674770355, 0.013881809078156948, -0.022074559703469276, -0.02838018536567688, ... ] Note
The number of dimensions in the output have been truncated for readability.
You can also view your vector embeddings in the Atlas UI by navigating to the
sample_db.embeddings
collection in your cluster.
Update your package.json
file.
Configure your project to use ES modules
by adding "type": "module"
to your package.json
file
and then saving it.
{ "type": "module", // other fields... }
Create vector embeddings and store them in Atlas.
Create a file named
create-embeddings.js
and paste the following code. This code does the following:Creates and exports an asynchronous function named
getEmbedding
that uses OpenAI'stext-embedding-3-small
model to generate an embedding for a given input.Generates embeddings from the sample text data and ingests them into the
sample_db.embeddings
collection in Atlas by using the MongoDB Node.js Driver.
create-embeddings.jsimport { MongoClient } from 'mongodb'; import OpenAI from 'openai'; // Setup OpenAI configuration const openai = new OpenAI({ apiKey: "<apiKey>", }); // Function to get the embeddings using the OpenAI API export async function getEmbedding(text) { const response = await openai.embeddings.create({ model: "text-embedding-3-small", input: text, encoding_format: "float", }); return response.data[0].embedding; } // Data to embed const data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] async function run() { // Connect to your Atlas cluster const uri = "<connectionString>"; const client = new MongoClient(uri); try { await client.connect(); const db = client.db("sample_db"); const collection = db.collection("embeddings"); // Ingest data and embeddings into Atlas await Promise.all(data.map(async text => { // Check if the document already exists const embedding = await getEmbedding(text); const existingDoc = await collection.findOne({ text: text }); if (!existingDoc) { await collection.insertOne({ text: text, embedding: embedding }); console.log(embedding); } })); } catch (err) { console.log(err.stack); } finally { await client.close(); } } run().catch(console.dir); Replace the following placeholder values:
<apiKey>
with your OpenAI API key.<connectionString>
with your Atlas cluster's SRV connection string.
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net Save the file, then run the following command:
node create-embeddings.js [ 0.031927742, -0.014192767, -0.021851597, 0.045498233, -0.0077904654, ... ] [ -0.01664538, 0.013198251, 0.048684783, 0.014485021, -0.018121032, ... ] [ 0.030449908, 0.046782598, 0.02126599, 0.025799986, -0.015830345, ... ] Note
The number of dimensions in the output have been truncated for readability.
You can also view your vector embeddings in the Atlas UI by navigating to the
sample_db.embeddings
collection in your cluster.
Create a function to generate vector embeddings.
Paste and run the following code in your notebook to create a function that generates vector embeddings by using an open-source embedding model from Nomic AI. This code does the following:
Loads the nomic-embed-text-v1 embedding model.
Creates a function named
get_embedding
that uses the model to generate an embedding for a given text input.Generates a single embedding for the string
foo
.
from nomic import embed from sentence_transformers import SentenceTransformer # Load the embedding model (https://huggingface.co/nomic-ai/nomic-embed-text-v1") model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True) # Define a function to generate embeddings def get_embedding(data): """Generates vector embeddings for the given data.""" embedding = model.encode(data) return embedding.tolist() # Generate an embedding get_embedding("foo")
[-0.029808253049850464, 0.03841473162174225, -0.02561120130121708, -0.06707508116960526, 0.03867151960730553, ... ]
Create embeddings and store them in Atlas.
Paste and run the following code in your notebook to connect to your
Atlas cluster, generate embeddings from custom data, and
ingest them into the sample_db.embeddings
collection by using the MongoDB
PyMongo Driver. Replace <connection-string>
with
your Atlas cluster's SRV connection string.
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_db"] collection = db["embeddings"] # Sample data data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] # Ingest data into Atlas for text in data: embedding = get_embedding(text) collection.insert_one({ "text": text, "embedding": embedding })
After running the sample code, you can
view your vector embeddings in the Atlas UI
by navigating to the sample_db.embeddings
collection in your cluster.
Create a function to generate vector embeddings.
Paste and run the following code in your notebook to create
a function that generates vector embeddings by using a
proprietary embedding model from
OpenAI.
Replace <api-key>
with your OpenAI API key.
This code does the following:
Specifies the
text-embedding-3-small
embedding model.Creates a function named
get_embedding
that calls the model's API to generate an embedding for a given text input.Generates a single embedding for the string
foo
.
import os from openai import OpenAI # Specify your OpenAI API key and embedding model os.environ["OPENAI_API_KEY"] = "<api-key>" model = "text-embedding-3-small" openai_client = OpenAI() # Define a function to generate embeddings def get_embedding(text): """Generates vector embeddings for the given text.""" embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding return embedding # Generate an embedding get_embedding("foo")
[-0.005843308754265308, -0.013111298903822899, -0.014585349708795547, 0.03580040484666824, 0.02671629749238491, ... ]
Create embeddings and store them in Atlas.
Paste and run the following code in your notebook to connect to your
Atlas cluster, generate embeddings from custom data, and
ingest them into the sample_db.embeddings
collection by using the MongoDB
PyMongo Driver. Replace <connection-string>
with
your Atlas cluster's SRV connection string.
Note
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") db = mongo_client["sample_db"] collection = db["embeddings"] # Sample data data = [ "Titanic: The story of the 1912 sinking of the largest luxury liner ever built", "The Lion King: Lion cub and future king Simba searches for his identity", "Avatar: A marine is dispatched to the moon Pandora on a unique mission" ] # Ingest data into Atlas for text in data: embedding = get_embedding(text) collection.insert_one({ "text": text, "embedding": embedding })
After running the sample code, you can
view your vector embeddings in the Atlas UI
by navigating to the sample_db.embeddings
collection in your cluster.
Create Embeddings for Queries
The following procedure shows how to create an embedding for Atlas Vector Search queries by using an open-source or proprietary embedding model.
Procedure
Once you've created embeddings from the sample data, complete the following steps to create an Atlas Vector Search index on your data and embeddings for vector search queries.
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type, 768
vector
dimensions, and the similarity measure as euclidean
.
Create a file named named
create-index.js
and paste the following code.create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const uri = "<connectionString>"; const client = new MongoClient(uri); async function run() { try { const database = client.db("sample_db"); const collection = database.collection("embeddings"); // define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "numDimensions": 768, "path": "embedding", "similarity": "euclidean" } ] } } // run the helper method const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); Replace
<connectionString>
with your Atlas cluster's SRV connection string.Save the file, then run the following command:
node create-index.js
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
vector-query.js
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, the following code does the following:
Creates an embedding for the string
ocean tragedy
by calling the embedding function that you defined in the previous example.Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs the vector search query. It return semantically similar documents in order of relevance and their vector search score.
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './create-embeddings.js'; // MongoDB connection URI and options const uri = "<connectionString>"; const client = new MongoClient(uri); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_db"); const collection = database.collection("embeddings"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("ocean tragedy"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, text: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); Replace
<connectionString>
with your Atlas cluster's SRV connection string.Save the file, then run the following command:
node vector-query.js '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.5103757977485657}' '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4616812467575073}' '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.4115804433822632}'
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Complete the following steps to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type, 1536
vector
dimensions, and the similarity measure as euclidean
.
Create a file named named
create-index.js
and paste the following code.create-index.jsimport { MongoClient } from 'mongodb'; // connect to your Atlas deployment const uri = "<connectionString>"; const client = new MongoClient(uri); async function run() { try { const database = client.db("sample_db"); const collection = database.collection("embeddings"); // define your Atlas Vector Search index const index = { name: "vector_index", type: "vectorSearch", definition: { "fields": [ { "type": "vector", "numDimensions": 1536, "path": "embedding", "similarity": "euclidean" } ] } } // run the helper method const result = await collection.createSearchIndex(index); console.log(result); } finally { await client.close(); } } run().catch(console.dir); Replace
<connectionString>
with your Atlas cluster's SRV connection string.Save the file, then run the following command:
node create-index.js
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
Create a file named named
vector-query.js
and paste the following code.To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, the following code does the following:
Creates an embedding for the string
ocean tragedy
by calling the embedding function that you defined in the previous example.Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs the vector search query. It return semantically similar documents in order of relevance and their vector search score.
vector-query.jsimport { MongoClient } from 'mongodb'; import { getEmbedding } from './create-embeddings.js'; // MongoDB connection URI and options const uri = "<connectionString>"; const client = new MongoClient(uri); async function run() { try { // Connect to the MongoDB client await client.connect(); // Specify the database and collection const database = client.db("sample_db"); const collection = database.collection("embeddings"); // Generate embedding for the search query const queryEmbedding = await getEmbedding("ocean tragedy"); // Define the sample vector search pipeline const pipeline = [ { $vectorSearch: { index: "vector_index", queryVector: queryEmbedding, path: "embedding", exact: true, limit: 5 } }, { $project: { _id: 0, text: 1, score: { $meta: "vectorSearchScore" } } } ]; // run pipeline const result = collection.aggregate(pipeline); // print results for await (const doc of result) { console.dir(JSON.stringify(doc)); } } finally { await client.close(); } } run().catch(console.dir); Replace
<connectionString>
with your Atlas cluster's SRV connection string.Save the file, then run the following command:
node vector-query.js '{"text":"Titanic: The story of the 1912 sinking of the largest luxury liner ever built","score":0.4551968574523926}' '{"text":"Avatar: A marine is dispatched to the moon Pandora on a unique mission","score":0.4050074517726898}' '{"text":"The Lion King: Lion cub and future king Simba searches for his identity","score":0.3594386577606201}'
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Run the following code to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type, 768
vector
dimensions, and the similarity measure as euclidean
.
from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "numDimensions": 768, "similarity": "euclidean" } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model)
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, the following code does the following:
Creates an embedding for the string
ocean tragedy
by calling the embedding function that you defined in the previous example.Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs the vector search query. It return semantically similar documents in order of relevance and their vector search score.
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": true, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{'text': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built','score': 0.5103757977485657} {'text': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission','score': 0.4616812467575073} {'text': 'The Lion King: Lion cub and future king Simba searches for his identity','score': 0.4115804433822632}
Create the Atlas Vector Search index.
To enable vector search queries on your data, you must create an Atlas Vector Search index on your collection.
Run the following code to create an index on the
sample_db.embeddings
collection that specifies the
embedding
field as the vector type, 1536
vector
dimensions, and the similarity measure as euclidean
.
from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "embedding", "numDimensions": 1536, "similarity": "euclidean" } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model)
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
To run a vector search query, generate a query vector to pass into your aggregation pipeline.
For example, the following code does the following:
Creates an embedding for the string
ocean tragedy
by calling the embedding function that you defined in the previous example.Passes the embedding into the
queryVector
field in your aggregation pipeline.Runs the vector search query. It return semantically similar documents in order of relevance and their vector search score.
# Generate embedding for the search query query_embedding = get_embedding("ocean tragedy") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "embedding", "exact": true, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{'text': 'Titanic: The story of the 1912 sinking of the largest luxury liner ever built','score': 0.4551968574523926} {'text': 'Avatar: A marine is dispatched to the moon Pandora on a unique mission','score': 0.4050074517726898} {'text': 'The Lion King: Lion cub and future king Simba searches for his identity','score': 0.3594386577606201}
Tip
You can also create embeddings by calling the API endpoint directly. To learn more, see OpenAI API Reference.
To learn more about running vector search queries, see Run Vector Search Queries.