How to Use Cohere's Quantized Vectors to Build Cost-effective AI Apps With MongoDB
Richmond Alake, Mihir Patil23 min read • Published Oct 02, 2024 • Updated Oct 03, 2024
Rate this tutorial
Gen AI works. So the real work for developers begins.
We’ve entered a phase in the generative AI era where many proof-of-concept projects are getting promoted to production environments. These applications, both internal and external facing, are now being used by hundreds, thousands, and, in some cases, millions of users. As a result, developers are now tasked with implementing operational strategies to optimize both computation and storage for the AI applications they develop.
This tutorial focuses on an operational cost methodology to provide a better solution to data storage. I present methods showing how MongoDB can help developers build and maintain cost-effective AI applications optimized for scale in production.
Now that we’ve set the stage, we can introduce the topic of quantization.
Quantization is a technique prominently used in digital signal processing and machine learning. It is leveraged as a method for reducing the storage footprint required for a set of numerical values by creating a representation of the original set that requires less memory and storage.
Within the context of AI and machine learning, the creation of a memory-efficient subset of an original set of values typically involves reducing the precision of the original values. This process converts the numerical values, which are usually stored using 32-bit floating-point numbers, into lower-precision values stored with data types such as 8-bit integers.
This technique allows for a significant reduction in memory storage use while preserving much of the information of the original data. It is particularly useful in scenarios where storage or computational resources are limited and when dealing with large-scale datasets that require the storage of 1,000,000+ vector embeddings. The key takeaway here is that the ability of MongoDB Atlas to support the ingestion of quantized embeddings offers developers greater flexibility, scalability, and storage efficiency.
In this article, we cover the following:
- What are quantization and vector quantization?
- Example of scalar quantization
- Efficient quantized vector ingestion with Cohere and MongoDB
- Comparison of float32, BSON float32, and BSON int8 embeddings in vector search operations
- Practical considerations for choosing the right embedding format for your AI application
So far, we understand that quantization is a technique that reduces the memory storage requirement of numerical values by reducing the precision of the data format used to store the numerical values. In more practical terms, here’s what this means for calculating the memory usage of 1536-dimensional embeddings.
Within an embedding with a dimension size of 1536, there are 1536 individual numerical values, each representing a distinct feature or aspect of the input data. Each numerical value is typically represented by the 32-bit floating-point data format (float32).
Float32 can store very large (3.402823e+38) and small (1.175494e-38) numbers. The general characteristics of the float32 data format are:
1. Each number uses 32 bits (4 bytes) of memory storage.
2. The 32 bits are divided into three parts:
- 1 bit for the sign (positive or negative)
- 8 bits for the exponent
- 23 bits for the fraction (also called mantissa or significand)
3. It provides about seven decimal digits of precision.
For a 1536-dimensional embedding, using float32 for each value means the total memory required is:
- 1536 * 4 bytes = 6,144 bytes ≈ 6 KB per embedding
Let’s calculate this at scale. Consider the memory requirements to store 1,000 and 1,000,000 embeddings:
Number of Embeddings | Memory Calculation | Total Storage Requirements |
1,000 | 6 KB * 1,000 = 6,000 KB | ≈ 5.86 MB |
1,000,000 | 6 KB * 1,000,000 = 6,000,000 KB | ≈ 5.72 GB |
1,000 embeddings, requiring 5.86 kilobytes of storage, can be easily generated when documents such as academic research papers of a few pages are processed, and 1,000,000, requiring almost 6GB embedding, can be generated in use cases that involve the encoding of a vast corpus of text such as document retrieval and product recommendation use cases.
As your AI application scales, storing 1,000,000+ embeddings can become a common requirement, which directly causes memory storage to increase, impacting infrastructure cost, application latency, and performance. This is how important quantization is when operating AI applications at scale.
Vector quantization is a lossy compression algorithm designed to reduce the memory requirements and storage capacity of high-dimensional vector data. It achieves this by mapping the elements of the original vectors to a downscaled set of representative vectors. This process allows for significant data compression while preserving much of the essential information contained in the original vectors. The vector quantization technique is particularly useful in AI applications where some loss of precision is acceptable in exchange for reduced storage needs and lower latency in data retrieval steps.
Let's demonstrate vector quantization by quantizing float32 values and to 8-bit integers, i.e., int8. For clarity, we'll use a simplified example with 1D vectors. More specifically, the operation shown below leverages scalar quantization techniques that involve selecting the min and max values across all indexed vectors for each dimension and producing equally sized bins between them. The mappings for each of these dimensions to the bins yield the new quantized values (i.e., float vector values mapped to 255 bins become int8 vector values).
Step 1: Determine the range of values.
1 Min: 1.23 2 Max: 6.78
Step 2: Define the quantization scale.
1 Scale = (Max - Min) / 255 = (6.78 - 1.23) / 255 ≈ 0.0217
Step 3: Quantize to 8-bit integers using the formula below.
1 quantized = round((original - Min) / Scale)
In Python code, we can implement the steps above into a function `quantize_vector`:
1 def quantize_vector(vector, bits=8): 2 """Quantize a float32 vector to n-bit integers.""" 3 4 # Convert the vector to a numpy array 5 vector = np.array(vector, dtype=np.float32) 6 7 # Step 1: Determine the range of values. 8 min_val, max_val = vector.min(), vector.max() 9 10 # Step 2: Define the quantization scale 11 scale = (max_val - min_val) / (2**bits - 1) 12 13 # Step 3: Quantize to 8-bit integers using the formula below: quantized = round((original - Min) / Scale) 14 quantized = np.round((vector - min_val) / scale).astype(np.uint8) 15 16 return quantized, min_val, scale
The function above implements the quantization process by first converting the vector into a numpy array, which is done to leverage numpy's efficient array operations and broadcasting capabilities. The next step finds the minimum and maximum elements in the array. After determining the range of values, the function calculates the quantization scale. This scale represents the step size between each quantized value and is crucial for mapping the continuous float32 values to discrete integer values.
The quantization formula, `
quantized = round((original - Min) / Scale)`
, is then applied to each element of the vector.The rest of the code for this process and the result is below:\
1 def dequantize_vector(quantized, min_val, scale, bits=8): 2 """Dequantize the vector back to float32.""" 3 return (quantized.astype(np.float32) * scale) + min_val 4 5 # Original float32 vector 6 original = np.array([1.23, 4.56, 2.34, 5.67, 3.45, 6.78, 1.89, 4.90], dtype=np.float32) 7 8 # Run Quantize 9 quantized, min_val, scale = quantize_vector(original) 10 11 # Dequantize (reconstruct) 12 reconstructed = dequantize_vector(quantized, min_val, scale) 13 14 # Print results 15 print("Original (float32):") 16 print(original) 17 print("\nQuantized (uint8):") 18 print(quantized) 19 print(f"\nQuantization parameters - Min: {min_val}, Scale: {scale}") 20 print("\nReconstructed (float32):") 21 print(reconstructed) 22 print("\nMean Absolute Error:") 23 print(np.mean(np.abs(original - reconstructed))) 24 25 # Memory usage 26 print(f"\nOriginal memory usage: {original.nbytes} bytes") 27 print(f"Quantized memory usage: {quantized.nbytes} bytes") 28 print(f"Compression ratio: {original.nbytes / quantized.nbytes:.2f}x")
The result of the quantization process is shown below:
There are three common types of vector quantization, as shown below.
Considerations | Scalar Quantization | Binary Quantization | Product Quantization |
Data Size | Decreased: Reduces original data size by mapping values to smaller scalars; typically achieves moderate compression | Minimized: Represents each dimension with a single bit, dramatically reducing storage needs; offers maximum compression in comparison to other methods | Further decreased than scalar but less than binary: Divides vectors into subvectors and quantizes each separately, resulting in significant space savings compared to scalar methods |
Accuracy | Slightly reduced: Introduces some loss of precision due to rounding or truncation of values; generally maintains reasonable accuracy for many applications | Least accurate: Reduces each dimension to a binary value, resulting in the most significant information loss, but it can still be effective for tasks with sufficient rough similarity | Better than scalar: Preserves more of the original vector structure by quantizing subspaces independently; often provides a good balance between compression and accuracy |
Best Use Case | Balanced storage and accuracy needs: Suitable for applications with moderate compression without sacrificing too much accuracy | Large-scale data where direction matters most: Excellent for extremely large datasets in applications like semantic search or clustering, where the general direction of vectors is more important than precise magnitudes | Optimizing storage with high accuracy: Ideal for large-scale similarity search, information retrieval, and recommendation systems where efficiency and precision are crucial |
When should AI developers consider vector quantization techniques?
- Large-scale datasets: Consider quantization when dealing with massive datasets containing millions of vector embeddings, such as in document retrieval or product recommendation systems. For instance, storing 1,000,000 embeddings, each with 1536 dimension size and in float32 format, requires approximately 5.72 GB.
- Resource-constrained environments: Consider quantization in situations where storage or computational resources are limited, such as edge devices or applications with strict memory constraints.
- Real-time applications: For AI applications running in real-time that prioritize optimal low-latency performance, quantized vectors can lead to faster data transfer and potentially quicker computations.
- Cost optimization: Use quantization when operational costs for AI applications in production need to be reduced, especially for applications serving hundreds, thousands, or millions of users.
- When precision loss is acceptable: Quantization is appropriate in use cases where a slight loss in precision doesn't significantly impact the overall quality of results. As later sections in this tutorial will demonstrate, quantized int8 embeddings showed negligible differences in similarity scores compared to float32 embeddings.
Now that we have explored the concept of vector quantization, we can proceed to apply this knowledge in a practical context. The next section generates quantized embeddings using the Cohere API and stores them in a MongoDB database. This process will allow us to compare and contrast the benefits of quantized vectors versus their non-quantized counterparts.
This section provides a comprehensive guide to implementing efficient quantized vector ingestion and search using Cohere and MongoDB. We'll walk through the entire process, from generating quantized and non-quantized embeddings with Cohere's API to converting them into BSON format, ingesting them into MongoDB, and performing vector searches on the stored embeddings.
The end result of this section is a fully functional system that demonstrates the practical implementation of quantized vector storage and retrieval. This setup will allow you to compare the performance and accuracy of different embedding types (float32, BSON float32, and BSON int8) in vector search operations.
For the code implementation in this tutorial, the following libraries are utilized:
pandas
: A data manipulation library for efficient handling of structured data. It's used for loading, cleaning, transforming, and analyzing data in various formats.cohere
: The official Cohere Python library. It will provide access to advanced language models, embedding generation, and text generation.pymongo
: The official Python driver for MongoDB. While commented out in the installation, it suggests potential use for interacting with MongoDB databases, enabling storage and retrieval of data.
1 !pip install --quiet pandas cohere pymongo
Remember to gain an API key from Cohere before executing the code block below. The code snippet below sets the Cohere API key in the local environment.
1 import os 2 import cohere 3 import getpass 4 5 COHERE_API_KEY = getpass.getpass("Enter Cohere API Key: ") 6 os.environ["COHERE_API_KEY"] = COHERE_API_KEY
This step involves generating a small structured dataset of sentences with placeholder fields for various embedding types. The dataset is designed to facilitate the demonstration and comparison of different embedding techniques in natural language processing and database storage.
The data is structured as a list of dictionaries, where each dictionary represents a sentence and its associated embedding attributes. Each entry contains a
sentence
key with a factual statement as its value, along with four embedding-related keys: float32_embedding
, int8_embedding
, bson_float32_embedding
, and bson_int8_embedding
. These keys are initially set to None
.1 sentences = [ 2 { 3 "sentence": "Water boils at 100 degrees Celsius at standard atmospheric pressure.", 4 "float32_embedding": None, 5 "int8_embedding": None, 6 "bson_float32_embedding": None, 7 "bson_int8_embedding": None 8 }, 9 { 10 "sentence": "The Great Wall of China is visible from space.", 11 "float32_embedding": None, 12 "int8_embedding": None, 13 "bson_float32_embedding": None, 14 "bson_int8_embedding": None 15 }, 16 { 17 "sentence": "Photosynthesis converts light energy into chemical energy in plants.", 18 "float32_embedding": None, 19 "int8_embedding": None, 20 "bson_float32_embedding": None, 21 "bson_int8_embedding": None 22 }, 23 { 24 "sentence": "DNA contains the genetic instructions for all living organisms.", 25 "float32_embedding": None, 26 "int8_embedding": None, 27 "bson_float32_embedding": None, 28 "bson_int8_embedding": None 29 }, 30 { 31 "sentence": "Jupiter is the largest planet in our solar system.", 32 "float32_embedding": None, 33 "int8_embedding": None, 34 "bson_float32_embedding": None, 35 "bson_int8_embedding": None 36 }, 37 { 38 "sentence": "The Eiffel Tower was completed in Paris in 1889.", 39 "float32_embedding": None, 40 "int8_embedding": None, 41 "bson_float32_embedding": None, 42 "bson_int8_embedding": None 43 }, 44 { 45 "sentence": "Honey never spoils due to its unique chemical properties.", 46 "float32_embedding": None, 47 "int8_embedding": None, 48 "bson_float32_embedding": None, 49 "bson_int8_embedding": None 50 }, 51 { 52 "sentence": "The human body contains approximately 206 bones in adulthood.", 53 "float32_embedding": None, 54 "int8_embedding": None, 55 "bson_float32_embedding": None, 56 "bson_int8_embedding": None 57 }, 58 { 59 "sentence": "Mount Everest is the highest peak on Earth at 8,848m.", 60 "float32_embedding": None, 61 "int8_embedding": None, 62 "bson_float32_embedding": None, 63 "bson_int8_embedding": None 64 }, 65 { 66 "sentence": "Shakespeare wrote 37 plays and 154 sonnets during his lifetime.", 67 "float32_embedding": None, 68 "int8_embedding": None, 69 "bson_float32_embedding": None, 70 "bson_int8_embedding": None 71 }, 72 { 73 "sentence": "The speed of light in vacuum is 299,792,458 meters/second.", 74 "float32_embedding": None, 75 "int8_embedding": None, 76 "bson_float32_embedding": None, 77 "bson_int8_embedding": None 78 }, 79 { 80 "sentence": "Penguins are flightless birds found primarily in the Southern Hemisphere.", 81 "float32_embedding": None, 82 "int8_embedding": None, 83 "bson_float32_embedding": None, 84 "bson_int8_embedding": None 85 }, 86 { 87 "sentence": "The Mona Lisa was painted by Leonardo da Vinci.", 88 "float32_embedding": None, 89 "int8_embedding": None, 90 "bson_float32_embedding": None, 91 "bson_int8_embedding": None 92 }, 93 { 94 "sentence": "Oxygen makes up about 21% of Earth's atmosphere by volume.", 95 "float32_embedding": None, 96 "int8_embedding": None, 97 "bson_float32_embedding": None, 98 "bson_int8_embedding": None 99 }, 100 { 101 "sentence": "The American Civil War lasted from 1861 to 1865.", 102 "float32_embedding": None, 103 "int8_embedding": None, 104 "bson_float32_embedding": None, 105 "bson_int8_embedding": None 106 }, 107 { 108 "sentence": "Antibiotics are ineffective against viral infections like the common cold.", 109 "float32_embedding": None, 110 "int8_embedding": None, 111 "bson_float32_embedding": None, 112 "bson_int8_embedding": None 113 }, 114 { 115 "sentence": "The human heart beats approximately 100,000 times per day.", 116 "float32_embedding": None, 117 "int8_embedding": None, 118 "bson_float32_embedding": None, 119 "bson_int8_embedding": None 120 }, 121 { 122 "sentence": "Gold is one of the least reactive chemical elements.", 123 "float32_embedding": None, 124 "int8_embedding": None, 125 "bson_float32_embedding": None, 126 "bson_int8_embedding": None 127 }, 128 { 129 "sentence": "The first successful powered aircraft flew in 1903.", 130 "float32_embedding": None, 131 "int8_embedding": None, 132 "bson_float32_embedding": None, 133 "bson_int8_embedding": None 134 }, 135 { 136 "sentence": "There are 118 elements in the periodic table.", 137 "float32_embedding": None, 138 "int8_embedding": None, 139 "bson_float32_embedding": None, 140 "bson_int8_embedding": None 141 } 142 ]
The final step is to convert the list of sentence dictionaries into Pandas DataFrame for more efficient data visualization, manipulation, and analysis.
1 import pandas as pd 2 3 # Convert the list of dictionaries into a pandas DataFrame 4 sentences_df = pd.DataFrame(sentences) 5 6 sentences_df.head()
This step demonstrates the process of generating embeddings for our sentence dataset using the Cohere API. We will define a custom function,
get_cohere_embeddings
, for efficient embedding generation.1 import cohere 2 from typing import List, Tuple 3 4 # Initialize Cohere Client 5 co = cohere.Client(COHERE_API_KEY) 6 7 def get_cohere_embeddings(sentences: List[str]) -> Tuple[List[float], List[int]]: 8 """ 9 Generates embeddings for the provided sentences using Cohere's embedding model. 10 11 Args: 12 sentences (list of str): List of sentences to generate embeddings for. 13 14 Returns: 15 Tuple[List[float], List[int]]: A tuple containing two lists of embeddings (float and int8). 16 """ 17 generated_embedding = co.embed( 18 texts=sentences, 19 model="embed-english-v3.0", 20 input_type="search_document", 21 embedding_types=["float", "int8"] 22 ).embeddings 23 24 return generated_embedding.float, generated_embedding.int8
The code snippet above begins by importing the necessary modules:
cohere
for accessing the Cohere API and List
and Tuple
from the typing
module for type hinting.We then initialize the Cohere client using the previously set API key. This initialized client object
co
will be used to interact with Cohere's services.The
get_cohere_embeddings
takes a list of sentences and returns a tuple containing two lists of embeddings (float and int8 types). These are the two data types of embeddings we require for this tutorial.Key aspects of the function:
- It uses Cohere's
embed
method to generate embeddings. - The
embed-english-v3.0
, one of Cohere’s powerful models, is used for embedding generation. - The
input_type
parameter is set tosearch_document
, which specifies the use of Cohere's embedding generation model optimized for search-related tasks. This setting is particularly effective for generating embeddings of text passages that will be used in information retrieval or document search applications. - Both float and int8 embedding types are requested, allowing for flexibility of data types of embeddings in downstream processes.
The final operation in this step is to call the
get_cohere_embeddings
function against every data point in the dataset to generate the embeddings.1 # Iterate through each row in the DataFrame 2 for index, row in sentences_df.iterrows(): 3 # Fetch the embeddings for the sentence 4 embeddings = get_cohere_embeddings([row["sentence"]]) 5 6 # Assign the embeddings to the corresponding columns in the DataFrame 7 sentences_df.at[index, "float32_embedding"] = embeddings[0][0] # float32_embedding 8 sentences_df.at[index, "int8_embedding"] = embeddings[1][0] # int8_embedding
The code snippet above iterates through each row in the
sentence_df
using the iterrows()
method. The key aspects of this operation are iterate, generate embeddings, and assign.
- Iteration: The
iterrows()
method allows us to process each row individually, accessing both the index and the row's content. - Embedding generation: For each sentence, we call the
get_cohere_embeddings
function. Note that we pass the sentence as a single-element list to match the function's expected input format. - Data assignment: We use pandas'
at
accessor to efficiently assign the generated embeddings to their respective columns in the DataFrame.embeddings[0][0]
refers to the first (and only) float32 embedding.embeddings[1][0]
refers to the first (and only) int8 embedding. After completing the operation above, the final step is to view a segment of thesentences_df
with the generated embeddings.
1 sentences_df.head()
This step introduces a crucial data transformation process that, while potentially unfamiliar to some developers, is integral to building high-performant AI applications. When optimizing AI systems for scale and efficiency, certain techniques and data formats become essential to application development. The operations demonstrated here are fundamental to showcasing the memory optimization benefits of BSON (Binary JSON) and MongoDB's storage capabilities.
The operations demonstrated here are fundamental to showcasing the memory optimization benefits of BSON (Binary JSON) and MongoDB's storage capabilities. By converting our embeddings into BSON format, we can compare the storage efficiency of the BSON data format vs the non-BSON data format.
1 from bson.binary import Binary, BinaryVectorDtype 2 import numpy as np 3 def generate_bson_vector(array, data_type): 4 return Binary.from_vector(array, BinaryVectorDtype(data_type)) 5 6 # For all data points in the sentences_df generate BSON vectors for float32 and int8 embeddings 7 sentences_df["bson_float32_embedding"] = sentences_df["float32_embedding"].apply(lambda x: generate_bson_vector(np.array(x, dtype=np.float32), BinaryVectorDtype.FLOAT32)) 8 sentences_df["bson_int8_embedding"] = sentences_df["int8_embedding"].apply(lambda x: generate_bson_vector(np.array(x, dtype=np.int8), BinaryVectorDtype.INT8))
The code snippet above defines a function,
generate_bson_vector
, that will take in a NumPy array of a vector embedding of either float32 or int8 data format and convert it into BSON. One thing to note is that BSON (Binary JSON) is MongoDB's binary-encoded serialization of JSON-like documents, and using BSON offers several benefits, especially in the context of MongoDB:
- Efficiency: BSON is designed for efficient storage and traversal, making it faster to process than JSON in many cases.
- Type preservation: BSON maintains data types, ensuring that numbers, strings, and binary data are stored and retrieved correctly.
- MongoDB compatibility: As MongoDB uses BSON internally, storing data in this format allows for seamless integration and optimal performance. The
generate_bson_vector
function converts NumPy arrays into BSON binary format. This function uses theBinary.from_vector
method, specifically designed for vector data.
The code snippet above includes an operation that applies this function to each embedding in the DataFrame using the pandas'
apply
method. This operation is performed for float32 and int8 embeddings and creates new columns with the BSON-formatted data.Benefits of using BSON vs. other data types include:
- Space efficiency: BSON can be more space-efficient than storing raw arrays, especially for large datasets.
- Query performance: MongoDB can perform faster queries on BSON binary data, particularly for vector operations like similarity searches.
- Data integrity: BSON ensures that the binary data is stored and retrieved correctly, maintaining the precision of the original embeddings.
- Indexing support: MongoDB provides special indexing capabilities for BSON binary data, enabling efficient similarity searches on vector embeddings.
We will remove the int8_emebdding column from the dataset as we now have an optimized BSON version that will be used for downstream processes such as vector search operations.
1 # remove the coloumn int8_embedding from the dataframe 2 sentences_df = sentences_df.drop(columns=["int8_embedding"])
The last operation in this step is to visualize a segment of the dataset.
1 sentences_df.head()
For this step, you will need a MongoDB account to retrieve a connection string to a cluster.
Creating a database and collection within MongoDB is simple with MongoDB Atlas.
- Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
- Follow MongoDB’s steps to get the connection string from the Atlas UI. Securely store the URI within your development environment after setting up the database and obtaining the Atlas cluster connection URI.
1 MONGO_URI = getpass.getpass("Enter MongoDB URI: ") 2 os.environ["MONGO_URI"] = MONGO_URI
The next step is to create a simple function,
get_mongo_client
, that connects to our MongoDB cluster and returns an object of the MongoClient
class. This function establishes a connection to MongoDB using the provided URI and validates the connection with a ping command. If successful, it returns the client object, allowing for further database operations.1 import pymongo 2 3 def get_mongo_client(mongo_uri): 4 """Establish and validate connection to the MongoDB.""" 5 6 client = pymongo.MongoClient(mongo_uri, appname="devrel.showcase.quantized_cohere.python") 7 8 # Validate the connection 9 ping_result = client.admin.command('ping') 10 if ping_result.get('ok') == 1.0: 11 # Connection successful 12 print("Connection to MongoDB successful") 13 return client 14 else: 15 print("Connection to MongoDB failed") 16 return None 17 18 if not MONGO_URI: 19 print("MONGO_URI not set in environment variables")
Next, we will create a database
quantized
and a collection named data
.1 mongo_client = get_mongo_client(MONGO_URI) 2 3 db = mongo_client["quantized"] 4 col = db["data"]
To complete this step, we will ingest the
sentence_df
content into the data
collection.1 # Delete existing collection 2 col.delete_many({})
After this step, we now have a new database in our MongoDB cluster that contains a collection with a number of documents ingested from the rows in our
sentences_df
DataFrame. Each document in this collection represents a sentence from our original dataset, along with its corresponding embeddings in float32 format and both float32 and int8 BSON formats. This setup allows us to compare the storage efficiency and query performance between the standard float32 format and the BSON-encoded versions (float32 and int8).
In this step, we will create a critical component that facilitates vector search operations, a vector index. The
setup_vector_search_index
function creates a vector index, a specialized structure that significantly enhances the performance of similarity searches across embedding vectors.1 # Programmatically create vector search index for both colelctions 2 import time 3 from pymongo.operations import SearchIndexModel 4 5 def setup_vector_search_index(collection, index_definition, index_name="vector_index"): 6 """ 7 Setup a vector search index for a MongoDB collection and wait for 30 seconds. 8 9 Args: 10 collection: MongoDB collection object 11 index_definition: Dictionary containing the index definition 12 index_name: Name of the index (default: "vector_index") 13 """ 14 new_vector_search_index_model = SearchIndexModel( 15 definition=index_definition, 16 name=index_name, 17 type="vectorSearch" 18 ) 19 20 # Create the new index 21 try: 22 result = collection.create_search_index(model=new_vector_search_index_model) 23 print(f"Creating index '{index_name}'...") 24 25 # Sleep for 30 seconds 26 print(f"Waiting for 30 seconds to allow index '{index_name}' to be created...") 27 time.sleep(30) 28 29 print(f"30-second wait completed for index '{index_name}'.") 30 return result 31 32 except Exception as e: 33 print(f"Error creating new vector search index '{index_name}': {str(e)}") 34 return None
The code snippet above utilizes MongoDB's
SearchIndexModel
to define a vector search index. One operation to highlight is the 30-second wait period post-index creation. This ensures the index is fully built and optimized, preventing potential race conditions in subsequent query operations.The next step is to initialize the variable
vector_search_index_definition
that is assigned to the actual vector index, which is configured for multiple vector fields, each corresponding to different embedding formats (float32, BSON float32, and BSON int8). By setting up this index, we enable efficient cosine similarity searches across our 1024-dimensional embedding space. 1024 is the output dimensions of embeddings generated by Cohere's
embed-english-v3.0
embedding model.1 dimensions = 1024 # Output dimensions for vectors generated by Cohere's embed-english-v3.0 embedding model 2 3 vector_search_index_definition = { 4 "fields": [ 5 { 6 "type": "vector", 7 "path": "float32_embedding", 8 "numDimensions": dimensions, 9 "similarity": "cosine" 10 }, 11 { 12 "type": "vector", 13 "path": "bson_float32_embedding", 14 "numDimensions": dimensions, 15 "similarity": "cosine" 16 }, 17 { 18 "type": "vector", 19 "path": "bson_int8_embedding", 20 "numDimensions": dimensions, 21 "similarity": "cosine" 22 } 23 ] 24 }
The final operation in this step is to create the vector index using the definition in the
vector_search_index_definition
and function setup_vector_search_index
.1 setup_vector_search_index(col, vector_search_index_definition, index_name="vector_index")
Below is an example of the output of the operation above after execution.
1 Creating index 'vector_index'... 2 Waiting for 30 seconds to allow index 'vector_index' to be created... 3 30-second wait completed for index 'vector_index'. 4 vector_index
In this step, we'll define the
vector_search
function that leverages MongoDB's powerful aggregation framework to perform efficient vector search operations. This function is the intermediary layer between user queries and our database of sentence embeddings, enabling semantic search or RAG capabilities in your AI applications.The
vector_search
function accommodates different types of embeddings and search strategies. It takes a user's input query, typically a sentence from the application layer, and transforms it into a vector representation using our previously defined get_cohere_embeddings
function. This embedding, stored in the query_embedding
variable, becomes the basis for our similarity search.A key feature of this function is its ability to handle both quantized and non-quantized embeddings. By incorporating a check for the embedding type, we can dynamically adjust our search parameters to utilize either the space-efficient quantized (int8) embeddings or the higher-precision non-quantized (float32) embeddings.
By leveraging MongoDB's aggregation pipeline, we'll construct a series of stages that process our data, perform the vector similarity search, and return the most relevant results.
1 def vector_search(user_query, collection, quantized=False, use_bson=True): 2 """ 3 Perform a vector search in the MongoDB collection based on the user query. 4 5 Args: 6 user_query (str): The user's query string. 7 collection (MongoCollection): The MongoDB collection to search. 8 9 Returns: 10 list: A list of matching documents. 11 """ 12 13 path = None 14 query_embedding = None 15 16 # Generate embedding for the user query 17 query_embedding = get_cohere_embeddings([user_query]) 18 19 if query_embedding is None: 20 return "Invalid query or embedding generation failed." 21 22 if quantized: 23 query_embedding = query_embedding[1][0] 24 path = "int8_embedding" 25 else: 26 query_embedding = query_embedding[0][0] 27 path = "float32_embedding" 28 29 30 # Adjust path and convert vectors to BSON vectors if use_bson is True 31 if use_bson: 32 path = "bson_" + path 33 if quantized: 34 query_embedding = generate_bson_vector(query_embedding, BinaryVectorDtype.INT8) 35 else: 36 query_embedding = generate_bson_vector(query_embedding, BinaryVectorDtype.FLOAT32) 37 38 39 # Define the vector search pipeline 40 vector_search_stage = { 41 "$vectorSearch": { 42 "index": "vector_index", 43 "queryVector": query_embedding, 44 "path": path, 45 "numCandidates": 150, # Number of candidate matches to consider 46 "limit": 5 # Return top 5 matches 47 } 48 } 49 50 project_stage = { 51 "$project": { 52 "_id": 0, # Exclude the _id field 53 "sentence": 1, # Include the sentence field 54 "similarityScore": { 55 "$meta": "vectorSearchScore" # Include the search score 56 } 57 } 58 } 59 60 pipeline = [vector_search_stage, project_stage] 61 62 # Execute the search 63 results = collection.aggregate(pipeline) 64 return list(results)
The key operations of the code snippet above and the
vector_search
function are as follows:- Embedding generation: The
get_cohere_embeddings
function converts the user's query into a vector representation. - Embedding type selection: Based on the
quantized
flag, it chooses between float32 and int8 embeddings, adjusting the search path accordingly. - BSON conversion: If
use_bson
is True, the embedding is converted to BSON format using thegenerate_bson_vector
function, optimizing MongoDB storage and retrieval. - Search pipeline construction: The function builds a MongoDB aggregation pipeline with two stages:
$vectorSearch
: This utilizes the pre-created vector index for efficient similarity search. This operator takes the search query embedding as input, the path to the location of embedding data located in the collection, the number of candidates to consider for similarity, and the number of documents to be retrieved by the vector stage. -$project
: This shapes the output, including the original sentence and a similarity score.
In this step, we will use the
vector_search
function across different embedding configurations. This comparison allows us to evaluate the performance and accuracy of various vector representations in semantic search tasks.1 results_bson_float = vector_search("what is the speed of light?", col, quantized=False, use_bson=True) 2 results_bson_int8 = vector_search("what is the speed of light?", col, quantized=True, use_bson=True) 3 results_float = vector_search("what is the speed of light?", col, quantized=False, use_bson=False)
In the code snippet above, we execute three distinct vector searches using the query "what is the speed of light?" each targeting a different embedding format:
results_bson_float
: Uses BSON-encoded float32 embeddings, combining the precision of floating-point numbers with the efficiency of BSON storageresults_bson_int8
: Utilizes BSON-encoded quantized (int8) embeddings, offering a more compact representation that may trade some precision for improved storage and query efficiency`results_float`
: Uses standard float32 embeddings without BSON encoding, serving as a baseline for comparison
These results will provide insights into how different embedding representations affect semantic search outcomes, helping you choose the most suitable approach for your specific use cases; typical factors to consider are storage optimization, search speed, and result precision.
1 import pprint 2 3 print("Float32 Results") 4 pprint.pprint(results_float) 5 print("-----------------") 6 print() 7 print("BSON Float32 Results") 8 pprint.pprint(results_bson_float) 9 print("-----------------") 10 print() 11 print("BSON Int8 Results") 12 pprint.pprint(results_bson_int8) 13 print("-----------------")
The code snippet above will provide an output that displays the results of our vector searches in a more readable format. The output is organized into three sections, each corresponding to a different embedding representation:
- Float32 results: These are the outcomes from using standard float32 embeddings without BSON encoding.
- BSON Float32 results: This shows the results of using BSON-encoded float32 embeddings.
- BSON Int8 results: These are the results from using BSON-encoded quantized (int8) embeddings.
By presenting the results side-by-side, we can easily compare the performance and accuracy of each embedding type. This comparison is crucial for several reasons:
- It allows for a direct assessment of how BSON encoding affects search results compared to standard float32 representations.
- It showcases the impact of quantization (int8) on search accuracy and, potentially, on performance.
- It provides insights into the trade-offs between precision (float32) and efficiency (int8) in vector search operations.
The output of the previous code snippet is as shown below:
Based on these results, we can draw several important conclusions about the benefits of quantization and BSON encoding in vector search operations:
We will start off with an important consideration, which is accuracy preservation. The results demonstrate that MongoDB’s native support for Vectors in BSON and quantization maintain a high degree of accuracy in semantic search. The top result for all three methods is identical, correctly identifying the sentence about the speed of light, with very similar similarity scores (0.8345 for float32 and BSON float32, 0.8344 for BSON int8).
Another factor to consider is the consistency across formats. The order of results and the similarity scores are remarkably consistent across all three formats. This suggests that the use of BSON encoding and even quantization to int8 does not significantly compromise the quality of search results. Based on this, let’s recap the benefits of BSON and quantized vectors for AI applications.
1. Benefits of native vector BSON type:
- Storage efficiency: While not directly visible in the results from this tutorial, BSON encoding typically offers more compact storage compared to raw float32 values, potentially reducing database size. Follow the advanced notebook to get an overview of BSON's storage efficiency.
- Query performance: BSON's binary format can lead to faster serialization/deserialization, potentially improving query speeds in large-scale applications.
2. Advantages of quantization (int8):
- Space efficiency: Int8 representation requires significantly less storage space compared to float32, which can be crucial for large datasets.
- Potential performance gains: Reduced data size can lead to faster data transfer and potentially quicker computations, which is especially beneficial for resource-constrained environments.
3. Minimal precision loss:
The slight variations in similarity scores for the BSON int8 results (e.g., 0.8344 vs 0.8345 for the top result) indicate that the precision loss from quantization is minimal and does not affect the overall ranking of results.
The results of this tutorial demonstrate that you can leverage the benefits of both BSON encoding and quantization without significant compromise to search accuracy.
In an internal experiment, we compared the storage efficiency of different representations for vector embeddings: floats and BSON vectors using float32 precision. The test dataset consisted of three million documents, each containing embeddings generated by the cohere-embed-multilingual-v3.0 model with 1024 dimensions, like this tutorial. The results were as follows:
3M docs, cohere-embed-multilingual-v3.0@1024d | MongoDB Storage Size (GB) |
---|---|
Array of floats | 41 |
BSON Vectors float32 | 14 |
This comparison demonstrates a significant reduction in storage requirements when using BSON vectors with float32 precision. The BSON representation required only about 34% of the storage space needed for the array of floats representation, resulting in approximately 66% space savings. This substantial difference in storage efficiency has important implications for large-scale vector databases.
BSON encoding offers improved storage and potential performance benefits, while further quantization to int8 from float32 embeddings, as in this tutorial, can provide additional space savings with negligible impact on result quality.
The choice between these formats (array of floats, BSON vectors, or quantized representations) should be based on specific application needs, balancing storage constraints, processing speed requirements, and the level of precision required for the particular use case.
A key takeaway is that MongoDB Atlas enhances your AI stack by providing robust indexing support for quantized embedding model outputs from providers such as Cohere. This capability significantly improves the scalability of AI workloads, particularly those involving vector search, allowing them to handle substantially higher volumes. The smaller memory footprint of quantized embeddings results in more efficient storage utilization and cost-effectiveness. These features make MongoDB Atlas an excellent choice for organizations looking to optimize their AI applications, particularly those involving large-scale vector search operations.
This tutorial covers an introductory approach to evaluating the benefits of quantized embeddings. For a more detailed and advanced exploration of the memory optimization benefits of quantized vectors and BSON data format, check out the GitHub notebook.
1. What is vector quantization, and how does it improve performance in AI applications such as RAG pipelines?
Vector quantization is a lossy compression technique that reduces high-dimensional vector data's dimensionality and memory requirements. It maps original vectors to a smaller set of representative vectors, allowing for significant data compression while preserving much of the essential information. In AI applications, this improves performance by:
- Reducing storage requirements, especially for large-scale datasets.
- Decreasing data transfer times due to smaller data sizes.
- Potentially speeding up computations and queries, particularly beneficial in resource-constrained environments.
2. What are the trade-offs between float32 and int8 embeddings in vector search applications?
The main trade-offs between float32 and int8 embeddings are:
Float32:
- Higher precision
- Larger storage requirements
- Potentially slower query performance due to larger data size
Int8 (quantized):
- Significantly reduced storage space
- Faster data transfer and potentially quicker computations
- Slight loss in precision, but often negligible in practice
Top Comments in Forums
There are no comments on this article yet.
Related
Tutorial
IoT and MongoDB: Powering Time Series Analysis of Household Power Consumption
Aug 28, 2024 | 6 min read
Article
Easy Deployment of MEAN Stack with MongoDB Atlas, Cloud Run, and HashiCorp Terraform
Jan 30, 2023 | 4 min read