Unlock the Power of Semantic Search With MongoDB Atlas Vector Search

MongoDB Atlas Vector search empowers developers to build powerful semantic search engines. Semantic search revolutionizes content discovery by understanding the true meaning of a query, rather than just matching keywords. This AI technology enhances user engagement by intelligently matching search intent with relevant content, transforming how users explore and interact with information.

Table of contents

Semantic search is an advanced search methodology that uses the context and intent of user queries to retrieve relevant content. Traditional search techniques compare the keywords in a search query with the keywords that appear in the searchable content. By contrast, semantic search compares the underlying semantic meaning of the query and content to deliver more intelligent results. This ensures a more natural and effective search experience for users.

How does semantic search work?

Semantic search is a two-phase process of data indexing and retrieval, transforming raw content into meaningful, searchable numeric vectors.

Data indexing: Unstructured data is converted into vectors by an embedding model. The vectors are stored in a vector-capable database like MongoDB Atlas, creating a searchable index of content.

Retrieval: A search query is converted into a vector by the same embedding model. The system performs a nearest neighbor search (k-NN) to find the stored content vectors that are most semantically similar to the query vector, returning the most relevant results to the user.

What is an embedding model?

An embedding model is a natural language processing (NLP) model that uses machine learning (ML) to convert content into numerical vectors representing the meaning and characteristics of that content. Closely related content will produce similar vectors. For example, a text embedding model recognizes that “eco-friendly” and “sustainable” convey similar meanings and generates closely aligned vectors to represent them.

Embedding models are trained on massive datasets, allowing them to understand complex patterns and relationships in language and content. While many embedding models focus on text, specialized models can also process other input types, including images, audio, and video, making them versatile tools for representing different forms of data.

Domain-specific embedding models can be trained on specialized datasets to better understand industry language and concepts. For example, a medical embedding model would understand clinical terminology, medical procedures, and pharmaceuticals. These domain models can understand nuanced relationships that general-purpose models may miss and will often outperform generic models when applied to relevant use cases.

Common text search techniques

Common text search techniques can be combined to improve search performance and handle variations in queries. However, while using these methods together can create a performant search engine, it still falls short of AI-driven techniques like semantic search.

Sometimes called “lexical search,” this technique looks for exact word matches between search queries and content. The top results for “New York City” will be documents with the highest number of total mentions of “New,” “York,” or “City.”

N-gram analysis

N-gram analysis is a technique that also considers sequences of words in the search text. In addition to searching for individual words, the search results will consider combinations of words as well. The top results for “New York City” will provide additional weight to documents that contain “New York,” “York City,” and even more weight to the full text “New York City.”

Term frequency/Iinverse document frequency (TF/IDF)

Term frequency/inverse document frequency (TF/IDF) is a statistical method that provides additional weight to terms (or sequences of terms) that appear less frequently in the searchable content. The results for “The best restaurants in New York City” would focus more on “restaurants” if the document set is all about New York City tourism. Conversely, if the document set is centered around the top restaurants in the United States, then “New York City” would be heavily weighted in the results. Common words like “the” would receive very low weight because they appear frequently in every document.

Knowledge graph

Knowledge graph is a method that maps relationships between entities to provide contextual search results. Unlike traditional text search techniques that focus on word matches, knowledge graphs create a network of interconnected information. These graphs can be complex and challenging to build and maintain, requiring significant computational resources and expert knowledge to develop, update, and use effectively.

Stop words

Common words like "the," "and," "in," and "of" that do not provide relevance to the results are filtered out entirely.

Stemming

Stemming is a process that reduces words to their root form. For example, "running," "runs," and "ran" would all match documents containing any form of "run.

Normalization

Normalization is aA preprocessing step that standardizes text to improve search consistency. Normalization would convert “NYC” into “New York City” to ensure that it is understood correctly.

Fuzzy matching

Fuzzy matching is a method for finding approximate matches to account for misspellings or variations. A search for "New Yrok" would score highly with "New York," while a search for “colour” would recognize “color.”

Text search can use various techniques to improve retrieval results, but it ultimately relies on finding content that contains exact matches or slight variations of the search terms. While advanced techniques like TF/IDF can create a more sophisticated search engine, these systems still do not understand meaning.

Semantic search seeks to understand the meaning of the content rather than matching words. It understands that “New York” and “New York City” are highly related but also that they are distinct entities from one another. Additionally, terms like “NYC,” “Manhattan,” or “The Big Apple” are inherently understood as a result of the contextual understanding that it learns from massive training datasets. This nuanced approach delivers more relevant results that better match user intent.

Let’s explore additional scenarios that compare text search to semantic search.

Text SearchSemantic Search
Word VariationsMust explicitly handle "color" vs. "colour" and other variationsUnderstands that “color” and “colour” are identical concepts
Synonyms and Related TermsRequires a predefined list to know that “NYC” is the same as “New York CityUnderstands related terms from its training data
Conceptual RelationshipsWill not understand that “Brooklyn” is a borough in “New York City”Understands geographical and conceptual relationships
Contextual Understanding"New York minute" and "New York pizza" are treated as similar because they share wordsUnderstands one is an expression about speed while the other is about food
Intent Matching"Where to stay in NYC" only matches content with those exact termsFinds content about hotels, accommodations, rentals, Airbnb, etc. in New York City
manhattan search query example

Hybrid search uses a combination of different search methods to provide search results. Text search excels at finding exact matches of query terms, providing precise results where the specific words are present in the document. In contrast, semantic search uses machine learning models to understand the contextual meaning of both the query and the documents, allowing it to retrieve results that are conceptually similar even when the exact query terms are not present. Hybrid search performs multiple search methods and combines the results to provide a more comprehensive and accurate search engine experience.

Hybrid search
Real-world use cases and examples where semantic search is making a difference

Let's explore how semantic search can be applied in real-world scenarios. Semantic search isn't just a theoretical concept; it's a practical solution that is powering search engines across various industries to optimize user experiences.

  • E-commerce: A semantic search engine for a clothing company will understand that “gloves” and “parkas” are both examples of “winter clothing.”
  • Healthcare: A search for “heart problems” will return data about cardiac conditions, risk factors, and related problems.
  • Financial services: A search for “saving for college” will surface information about educational savings accounts, investment options, and financial planning tools.
  • Customer support: Questions about “login problems” will return content about password resets, account access, and authentication issues.
Semantic search for retrieval-augmented generation

Retrieval-augmented generation (RAG) is a framework that allows a large language model (LLM) to incorporate external knowledge in its answers. RAG has become a popular technique for building and deploying AI applications because it enables domain customization, cost optimization, and transparency. RAG is a multi-step process that sends related content to an LLM alongside a user’s query in order to provide context-aware responses.

The first step in a RAG solution is information retrieval. The system will search through a knowledge base of content to find the documents that are most related to the query. Commonly, this will use semantic search techniques to ensure the contextual meaning of the query is properly understood.

The next step will send the retrieved content to the LLM alongside the user’s query, enabling the model to consider the additional context while generating its response. This approach allows language models to reference content that may not have been part of their original training data.

Text: Text-based search helps find relevant documents, articles, or customer support answers even when different words are used to describe the same idea. This is useful for chatbots, knowledge bases, and legal or medical document search.

Images: Image search allows users to find similar images. This is used in e-commerce for product recommendations, in social media for content discovery, and in medical imaging to find similar cases.

Audio: Audio search makes it possible to find spoken content in podcasts, music, or recorded conversations. It is used in media archives, music streaming services, and customer service call analysis.

Video: Video search enables finding specific moments within videos based on sounds or visual elements. This is useful for video streaming platforms, surveillance analysis, and educational content retrieval.

Cross-modal search: Certain embedding models can map different modalities like text and images to a shared vector space. This makes it possible to find images using text, discover related videos from an image, or match sounds to relevant content.

Improve your search with MongoDB Atlas Vector Search and Voyage AI

Ready to supercharge your search experience? MongoDB Atlas Vector Search combines with Voyage AI to transform how search engines find information, delivering smarter, more intuitive results that go beyond simple keyword matching. Unlock the power of semantic search and help your users discover exactly what they're looking for, faster, and more accurately than ever before.

Leverage your existing expertise in MongoDB Atlas to build a next-generation, relevance-based search engine. Learn more about how to do semantic search in MongoDB using Atlas Vector Search.

Get Started With MongoDB Atlas