LAUNCHMongoDB 8.3 is built for the sub-100ms retrieval & zero downtime AI demands. Read blog >
AI DATAStop fighting your data layer. Get the memory & retrieval agents need to scale. Read blog >

Similarity Search

Register Free Today

Wouldn't it be great to have a tool that not only finds what you're looking for but also understands the semantics and meaning behind your query? Enter the realm of similarity search, where conventional search engines bow out, and a new era of discovery begins.

In this post, we will explore the concept of similarity search, its importance, and different techniques used to perform similarity search efficiently.

Table of contents

Similarity search is a fundamental concept in data retrieval and analytics that involves finding similar data items based on their features or representations. 

It’s widely used alongside machine learning methods in applications like recommendation systems, image retrieval, and semantic search.

Similarity can be measured using different approaches, such as distance metrics or similarity scores. However, not all similarity measures are valid mathematical distance metrics—for example, correlation and cosine similarity don’t satisfy the triangle inequality. "Distance" and "similarity scores/measures" refer to the quantitative assessment of how alike or dissimilar two data points or queries are in a feature space. Distance is a metric that quantifies the dissimilarity or separation between two data points. It measures the "distance" in the feature space, indicating how far apart the points are from each other. On the other hand, similarity scores or measures directly quantify the degree of similarity between two data points. Here is an example:

  • If you compare two documents using cosine similarity, the result would be a score between 0 and 1. A score of 1 indicates that the documents are identical, while a score closer to 0 suggests dissimilarity. On the other hand, if you're using Euclidean distance, a smaller distance would again imply a higher degree of similarity.

Similarity search aims to locate items with similar characteristics to the query item, enabling data comparison, classification, clustering, and other analysis tasks.

This type of search is crucial in modern applications like image retrieval, natural language processing, and recommendation systems.

The fundamental idea behind similarity search is to represent data items (e.g., images, documents, user profiles) as vectors in a high-dimensional space. For instance, image classification models aim to answer the question, “What type of object is in this image?” These classifications are less useful to answer prompts such as, “What images are similar to this image?”

Similarity search algorithms enable efficient retrieval by representing unstructured data as vectors in a high-dimensional space to locate visually similar images and enhance search accuracy. Unlike image classification, which typically uses supervised learning algorithms to assign labels to images, similarity search involves creating image embeddings. 

Similarity search involves creating embeddings using feature extraction models like convolutional neural networks (CNNs) or modern transformer-based encoders such as CLIP and ViT. Nearest neighbor search techniques are then used to find images with embeddings similar to a given query image. This approach addresses the computational complexity of processing raw image data by working with their lower-dimensional embeddings, making the retrieval process more efficient.

The similarity between vectors is measured using a distance metric, such as cosine similarity or euclidean distance. The goal is to quickly approximate nearest neighbor search to a given query vector.

How does a similarity search work?

Here's a step-by-step explanation of how similarity search works:

1. Vector embeddings generation

Data items are first converted into vectors using a feature extraction or embedding technique.

For example, in order to extract feature vectors or specify anomaly detection, images are represented as vectors using convolutional neural networks, and text documents can be represented as indexed vectors using word embeddings or sentence embeddings.

2. Indexing and querying

The vectors are then indexed in a vector index within the database. Indexing is the process of organizing the vectors in a way that allows for efficient similarity search.

Traditional structures like k-d trees and ball trees work well for low-dimensional, exact searches but don’t scale to modern high-dimensional embeddings. 

For large datasets, approximate nearest neighbor (ANN) algorithms such as HNSW, IVF-PQ, and ScaNN are typically used for efficient vector search.

A k-d tree is a spatial data structure used for organizing points in a k-dimensional space. It recursively partitions the space along axes, creating a tree structure where each node represents a region in the space. 

Ball trees are another spatial data structure designed for organizing points in a metric space. Unlike k-d trees, ball trees use hyperspheres to enclose data points. The tree structure is built by recursively partitioning the dataset into subsets contained within hyperspheres. 

ANN algorithms are a class of algorithms that aim to find an approximate nearest neighbor rather than the exact nearest neighbor. These algorithms are valuable when dealing with large datasets or high-dimensional spaces, where exact searches may be computationally expensive. 

3. Performing search

Given a query vector, the vector search database retrieves the most similar vector space from the indexed dataset. The vector embedding is typically generated using the same feature extraction or embedding technique used to create the indexed vectors.

The similarity between the query vector and the indexed vectors is measured using a distance metric, and the most relevant embeddings are returned based on the semantic meaning and level of search accuracy.

Several types of distance metrics are commonly used in this context, each suitable for different scenarios. Some of the widely employed distance metrics include:

  • Cosine similarity:

    • Measures the cosine of the angle between two vectors. The range is -1 to 1 mathematically, but in practice, most embedding-based systems normalize vectors, so the range is effectively 0 to 1. It is particularly useful when the magnitude of the vectors is not relevant, and the focus is on the direction.

  • Euclidean distance:

    • Euclidean distance measures straight-line distance between two points, but it’s sensitive to vector magnitude. When you only care about direction (not scale), cosine similarity is usually preferred.

  • Manhattan distance (L1 norm):

    • Calculates the sum of absolute differences between the corresponding elements of two vectors. It is often preferred when movement along gridlines is more meaningful than diagonal movement.

  • Minkowski distance:

    • Generalizes both Euclidean and Manhattan distances. The Minkowski distance with a parameter of 2 is equivalent to the Euclidean distance, while with a parameter of 1, it is equivalent to the Manhattan distance.

  • Jaccard similarity:

    • Applied to sets, measures the size of the intersection of sets divided by the size of their union. It is commonly used in text mining and recommendation systems.

  • Hamming distance:

    • Specifically used for binary vectors. Hamming distance calculates the number of positions at which the corresponding bits are different.

  • Correlation distance:

    • Defined as `1 − correlation_coefficient`. A smaller distance indicates a higher correlation (greater similarity) between vectors.

The choice of distance metric depends on the nature of the data and the specific requirements of the similarity search task.

In the context of vector search with MongoDB, the choice of a suitable distance metric is crucial for optimizing search accuracy and efficiency. 

To perform vector searches, MongoDB provides a variety of querying mechanisms, and the choice of distance metrics aligns with the application's needs. 

For instance, cosine similarity can be leveraged using MongoDB's aggregation framework, which enables the computation of cosine similarity scores for efficient retrieval of similar vectors. 

Additionally, MongoDB's geospatial indexing features can be adapted for spatial data, providing a framework for utilizing distance metrics like Euclidean or Manhattan distance in geographic contexts. The versatility of MongoDB, combined with appropriate distance metrics, offers a robust solution for implementing vector search applications, ensuring effective retrieval of semantically relevant information from the indexed dataset.

Similarity search plays a vital role in many real-world applications, such as the following:

Recommendation systems

Similarity search is widely used in recommendation systems and search engines to provide personalized recommendations to users. By identifying items similar to the ones a user has shown interest in, recommendation systems can suggest relevant products, movies, music, or articles based on the level of semantic similarity.

Database search

Similarity search allows for efficient search in large databases, enabling retrieval of relevant information quickly. This is particularly useful when dealing with multimedia databases, where searching based on textual information alone may not be sufficient.

Data mining and pattern recognition

Similarity search is an essential component in data mining and pattern recognition tasks. It helps identify patterns, clusters, and anomalies in large datasets by locating items with similar properties or behavior.

Image and text analysis

In computer vision and natural language processing, similarity search is crucial for image and text analysis tasks such as object recognition, image retrieval including reverse image search, and document clustering based on semantic search.

There are several commonly used distance metrics used for vector similarity search to approximate nearest neighbor algorithms, including:

1.Euclidean distance: This metric of vector similarity search calculates the straight-line distance between two points in n-dimensional space. Being one of the popular distance metrics, it is often used for data points which are continuous and numerical.

Here's an example of how to calculate the Euclidean distance between vector representations in Python for similarity computations in machine learning:

 

2.Manhattan distance: This metric of vector similarity search calculates the distance between two points by considering the absolute differences of their coordinates in each dimension and summing them to efficiently search within a machine learning model. Being one of the popular distance metrics, it's less sensitive to outliers in vector embeddings than the Euclidean distance.

Here's an example of how to calculate the Manhattan distance for vector representations in Python enabling efficient search and exact matching by locating relevant features:

 

3.Cosine similarity: This metric of vector similarity search calculates the similarity between two vectors by considering their angle. It's often used for text data and is resistant to changes in the magnitude of the vectors.
Here's an example of how to calculate the cosine similarity between two vector embeddings in Python:

 

4.Pearson correlation coefficient: This metric calculates the linear correlation between two variables. It considers the relative importance of different features for vector embeddings. 

Here's an example of how to calculate the Pearson correlation coefficient between two variables in Python:

 

5.Jaccard similarity: This metric measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. It's often used in set operations for vector embeddings.

 

 

Each of these distance metrics has its own strengths and weaknesses, and the choice of which one to use depends on the specific requirements of your task.

Query vector

A query vector is a representation of a query in the form of a vector. It is used in similarity search to find the most similar data point to the query in a vector database. The vector is generated using the same method used to generate the vectors in the vector database, such as a feature extraction or embedding technique.

In the context of information retrieval, a query might be a text string, such as "jealous gossip." This query is transformed into a vector using a method like bag-of-words or Term Frequency-Inverse Document Frequency (TF-IDF), where each dimension corresponds to a unique word in the corpus, and the value in each dimension reflects the importance of that word in the query. TF-IDF is a numerical statistic used to evaluate the importance of a word in a document within a collection by considering both the frequency of the term in the document and its rarity across the entire corpus, assigning higher values to more relevant words.

Once the vector representation is generated, it is used to find the most similar vectors in the database. The similarity between the query and each document in the vector database is calculated using a distance metric, such as cosine similarity or Euclidean distance. The documents corresponding to the most similar vectors are then returned as the search results.

Efficient similarity search refers to the process of finding the most similar vectors to the input data in a quick and resource-efficient manner.

Traditional methods for similarity search, such as brute-force search, can be computationally expensive and slow, especially when dealing with large datasets. Efficient similarity search algorithms aim to balance the trade-off between speed and accuracy for a fixed memory usage.

Here's how a similarity search algorithm would operate:

  1. Pre-processing: Before performing the similarity search, vectors can optionally be compressed or partitioned. Some systems use techniques like Product Quantization (PQ) or Optimized PQ (OPQ) to reduce memory usage and speed up ANN search.
  2. Indexing: Once the data preprocessing is done, the vectors are added to an index. The index can be stored on disk or used immediately, and searches and additions/removals to the index can be interleaved.
  3. Searching: When a vector representation is provided, the algorithm finds the most similar vectors in the index to provide search results. This process involves calculating the distance between the query vector and each vector in the index, and returning the vectors with the smallest distances.

The similarity search is performed on a randomly generated query vector. The most similar vectors to the vector representations are retrieved and printed.

Vector similarity search, also known as vector search or nearest neighbor search, is a technique used to find the most similar vectors to the input data in a high-dimensional space. Vector similarity search is widely used in various fields such as natural language processing, image recognition, recommendation systems, and multimedia data analysis.

The fundamental idea behind vector similarity search is to represent data items (like images, documents, or user profiles) as vectors in a high-dimensional space. Each dimension in this space represents a specific feature or attribute of the data item.

The similarity between vectors is then measured using a distance metric, such as cosine similarity or Euclidean distance. The goal of a vector similarity search is to quickly find the most similar vectors to a given query vector.

One of the popular vector similarity search algorithms is used in recommendation systems, where it is used to find similar items or products in deep learning models based on the user’s preferences or behavior.

For instance, in a content-based recommendation system, vectors representing user preferences are compared with vectors representing items in a database to recommend items that are similar to the user’s interests.

k-NN algorithm

One commonly used variant of vector similarity search is the k-nearest neighbors (k-NN) algorithm.

In the k-NN algorithm, instead of finding just the nearest neighbor, we find the “k” nearest neighbors to the query point. In the k-nearest neighbors algorithm, k nearest neighbors to the query point refer to the k data points in the dataset that are most similar or closest to the given query point based on a specified similarity metric. The output depends on whether k-NN is used for classification or regression:

  • In classification, an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors.

  • In regression, the output is the property value for the object, which is the average of the values of k nearest neighbors.

A major drawback of k-NN is that to find the nearest vectors for a query, it has to calculate its distance with every vector in the database, which can be very inefficient if the database is large.

To overcome this, approximate nearest neighbor (ANN) algorithms are used. ANN algorithms trade off a small amount of accuracy for a significant speedup compared to exact vector similarity search algorithms. Examples of approximate nearest neighbor search algorithms include Locality-Sensitive Hashing (LSH), and others among ANN algorithms for vector search.

Here’s an example of how to perform an **exact** vector similarity search using the k-NN algorithm in Python with scikit-learn:

 

 

In this example, the NearestNeighbors model is initialized with n_neighbors set to 2, so it will find the two nearest neighbors for each query point. This is a commonly used distance metric for vector search among ANN algorithms.

The model is then fitted to the dataset for vector search, and a vector similarity search is performed on a query point in the search space. The distances and indices of the nearest neighbors are then printed for vector search in deep learning models.

Several techniques have been developed to efficiently perform vector similarity search. The choice of technique for vector search depends on factors such as the dataset size, dimensionality, desired query time, and available computational resources. Here are some commonly used techniques for vector similarity search:

  1. Hierarchical navigable small worlds (HNSW): HNSW is a graph-based algorithm commonly used for scalable, high-dimensional vector search. It organizes vectors in a layered graph structure that allows fast approximate nearest neighbor (ANN) queries with high recall.

  2. k-d tree: The k-d tree is a binary tree structure that partitions the data space into regions containing a fixed number of items. It is suitable for low-dimensional data rather than high-dimensional ones, yet it can suffer from imbalanced splits in high-dimensional spaces in vector embeddings.

  3. Locality sensitive hashing (LSH): LSH is a probabilistic technique that maps similar items into the same bucket with high probability. It is suitable for high-dimensional data and can efficiently handle data analysis tasks using vector similarity search.

  4. Graph-based methods: Graph-based methods represent the dataset as a graph, where each item is a node, and the edges represent the similarity between items. Algorithms such as nearest neighbor search and graph traversal techniques for vector similarity search can then be applied for similarity search in machine learning tasks.

  5. Metric space indexing: Metric space indexing involves building index structures that exploit the properties of a given similarity metric for vector similarity search. Techniques like M-tree, VP-tree, and R-tree can efficiently perform vector similarity search based on various distance metrics for vector embeddings.

  6. Dimensionality reduction: In high-dimensional scenarios, techniques like PCA, UMAP, and autoencoders can reduce vector dimensionality before indexing. This often improves search speed and reduces storage requirements while preserving relative similarity.

Key takeaways

  • Effective vector similarity search stands as a cornerstone in the realm of machine learning applications, playing a pivotal role in ensuring precise and swift data analysis.

  • Tackling obstacles inherent in high-dimensional data and scalability demands the deployment of strategic techniques like dimensionality reduction and indexing structures for vector embeddings.

  • The incorporation of adaptive distance metrics and approximate nearest-neighbor algorithms serves as a potent strategy to elevate both the accuracy and efficiency of the search process.

  • Particularly in the domain of computer vision, the significance of vector similarity search becomes apparent, offering substantial support in tasks such as object detection, image retrieval, recognition, and segmentation.

  • Through these applications, vector similarity search emerges as a catalyst for fostering advanced visual understanding and interpretation within the field.

Conclusion

Using similarity search, you can say goodbye to the mundane and embrace a search experience that's not just about finding but discovering the extraordinary in the ordinary.

Welcome to the future of exploration—where similarity isn't just a concept; it's the key to unlocking a universe of possibilities.

Get started with Atlas today

Get started in seconds. Our free clusters come with 512 MB of storage so you can play around with sample data and get oriented with our platform.
Try FreeContact sales
GET STARTED WITH:
  • 125+ regions worldwide
  • Sample data sets
  • Always-on authentication
  • End-to-end encryption
  • Command line tools