Vittal Pai

3 results

Boost the Accuracy of E-commerce Search Results with Atlas Vector Search

Artificial Intelligence’s (AI) growth has led to transformative advancements in the retail industry, including natural language processing, image recognition, and data analysis. These capabilities are pivotal to enhancing the efficiency and accuracy of e-commerce search results. E-commerce, characterized by its vast product catalogs and diverse customer base, generates enormous amounts of data every day. From user preferences and search histories to product reviews and purchase patterns — and add to that images, video, and audio associated with product campaigns and user search — the data is both a goldmine and a challenge. Traditional search mechanisms, which rely on exact keyword matches, are inadequate at handling such nuanced and voluminous data. This is where vector search comes into play as the perfect data mining tool . As a sophisticated search mechanism, it leverages AI-driven algorithms to understand the intrinsic relationships between data points. This enables it to discern complex patterns, similarities, and contexts that conventional keyword-based searches might overlook. Let’s dig deeper into the differences between traditional keyword matching search and vector search, and answer questions like: What type of queries does vector search improve in the retail search landscape? What are the challenges associated with it? And how can your business tap into the competitive advantage it represents? Check out our AI resource page to learn more about building AI-powered apps with MongoDB. Traditional Keyword Matching vs. Vector Search Traditional search functionalities for e-commerce platforms — keyword matching, typo tolerance, autocomplete, highlighting, facets, and scoring — are often built in-house or implemented on top of typical search engines like Apache Lucene, AtlasSearch, or ElasticSearch, relying heavily on metadata textual descriptions. While this has served the industry well for years, it often falls short of understanding the nuanced needs of modern consumers. For instance, a customer might be looking for a "blue floral summer dress," but if the product description lacks these terms, it might not appear in the search results, even if it perfectly matches the visual description. Figure 1: As embeddings encode numerically the meaning of documents, semantically close documents will be geometrically close as well. Vector search is a method that finds similar items in a dataset based on their vector representations, and offers a more efficient and accurate way to sift through large datasets. Instead of relying on exact matches, it uses mathematical techniques to measure the similarity between vectors, allowing it to retrieve items that are semantically similar to the user's query, even if the query and the item descriptions don't contain exact keyword matches. Figure 2: Data flow diagram showcasing how applications, vector embedding algorithms, and search engines work together at a high level. One great thing about Vector search is that by encoding any type of data, i.e. text, images or sound, you can perform queries on top of that, creating a much more comprehensive way of improving the relevance of your search results. Let’s explore examples of queries that involve context, intent, and similarity. Visual similarity queries Query: "Find lipsticks in shades similar to this coral lipstick." Vector Search Benefit: Vector search can recognize the color tone and undertones of the specified lipstick and suggest similar shades from the same or different brands. Data type: image or text Contextual queries Query: "Affordable running shoes for beginners." Vector Search Benefit: Vector search can consider both the price range and the context of "beginners," leading to relevant shoe suggestions tailored to the user's experience level and budget. Data type: text, audio (voice) Natural language queries Query: "Show me wireless noise-canceling headphones under $100." Vector Search Benefit: Capture intent. Vector search can parse the query's intent to filter headphones with specific features (wireless, noise-canceling) and a price constraint, offering products that precisely match the request. Data type: text, audio (voice) Complementary product queries Query: "Match this dress with elegant heels and a clutch." Vector Search Benefit: Vector search can comprehend the user's request to create a coordinated outfit by suggesting shoes and accessories that complement the selected dress. Data type: text, audio (voice), image Challenging landscape, flexible stack Now that we've explored different queries and their associated data types that could be used in vector embeddings for search, we can see how much more information can be used to deliver more accurate results and fuel growth. Let’s consider some of the challenges associated with a vector search solution data workflow and how MongoDB Atlas Vector Search helps bridge the gap between challenges and opportunities. Data overload The sheer volume of products and user-generated data can be overwhelming, making it challenging to offer relevant search results. By embedding different types of data inputs like images, audio (voice), and text queries for later use with vector search, we can simplify this workload. Storing your vector encoding in the same shared operational data layer your applications are built on top of, but also generating search indexes based on those vectors, makes it simple to add context to your application search functionalities. Using Atlas Vector Search combined with MongoDB App Services , you can reduce operational overhead by creating a trigger that could “see” when a new document is created in your collections and automatically make the call to the embedding API of your preference, pushing the document to it and storing the retrieved embedding data in the same document stored in your collection. Figure 3: Storing vectors with the data simplifies the overall architecture of your application. As the number of documents or vectors grows, efficient indexing structures ensure that search performance remains reasonable. By simply creating an index based on the embedded data field, you can leverage the optimized retrieval of the data, reduce the computational load, and accelerate its performance, especially for nearest neighbor search tasks, where the goal is to find items that are most similar to a given query. Altogether, the combination of MongoDB Vector Search capabilities with App Services and indexing provides a robust and scalable solution to achieve real-time responsiveness. An indexed vector search database can provide rapid query results, making it suitable for applications like recommendation engines or live search interfaces. Changing consumer behavior Developing an effective vector search solution involves understanding the nuances of the retail domain. Retailers must consider factors like seasonality, trends, and user behavior to improve the accuracy of search results. To overcome this challenge, retailers will need to be able to adjust their business model by categorizing their product catalogs and user data according to different criteria, for example: So as you can see all this vast amount of information can be embedded to build more comprehensive criteria for relevance, but first it needs to be properly captured and organized. This is where the value of the flexible document model comes into play. The document model allows you to define different fields and attributes for each category of data. This can be used to capture the various categorization criteria. Retailers could also utilize embedded subdocuments to associate relevant information with products or customers. For instance, you can embed a subdocument containing marketing campaign data, engagement channels, and geographic location within products to track their performance. As categorization criteria evolve, dynamic schema evolution allows you to add or modify fields without disrupting existing data. This flexibility easily accommodates changing business needs. Retailers may also use embedded arrays to record purchase history for customers. Each array element can represent a transaction, including product details and purchase date, facilitating segmentation based on recency and frequency. By embedding all these different data types, and leveraging the flexible capabilities of the document model, retailers can create a comprehensive and dynamic system that effectively categorizes data according to diverse criteria in a fast and resilient way. This enables personalized search experiences and enhanced customer engagement in the e-commerce space. Sitting on a goldmine Every retailer worldwide now realizes that with their customer data, they are sitting on a goldmine. Using the proper enabling technologies would allow them to build better experiences for their customers while infusing their applications with automated, data-driven decision-making. Retailers offering more intuitive and contextual search results can ensure their customers find what they're looking for by personalizing the relevance of their search results, enhancing satisfaction, and increasing the likelihood of successful transactions. The future of e-commerce search lies in harnessing the power of technologies like Atlas Vector Search , as it’s not only another vector search database, but also an extended product for the developer data platform , providing them with an integrated set of data and application services. For retailers, the message is clear: to offer unparalleled shopping experiences, embracing and integrating vector search functionalities with a performant and reliant platform that simplifies your data organization and storage is not just beneficial, it's essential. Learn more and discover How to Implement Databricks Workflows and Atlas Vector Search for Enhanced E-commerce Search Accuracy with our developer guide, and check out our GitHub repository explaining the full code for deploying an AI-Enhanced e-commerce search solution Head over to our quick-start guide to get started with Atlas Vector Search today.

October 11, 2023

Fusing MongoDB and Databricks to Deliver AI-Augmented Search

With customers' attention more and more dispersed across channels, platforms, and devices, the retail industry rages with the relentless competition. The customer’s search experience on your storefront is the cornerstone of capitalizing on your Zero Moment of Truth, the point in the buying cycle where the consumer's impression of a brand or product is formed. Imagine a customer, Sarah, eager to buy a new pair of hiking boots. Instead of wandering aimlessly through pages and pages of search results, she expects to find her ideal pair easily. The smoother her search, the more likely she is to buy. Yet, achieving this seamless experience isn't a walk in the park for retailers. Enter the dynamic duo of MongoDB and Databricks. By equipping their teams with this powerful tech stack, retailers can harness the might of real-time in-app analytics. This not only streamlines the search process but also infuses AI and advanced search functionalities into e-commerce applications. The result? An app that not only meets Sarah's current expectations but anticipates her future needs. In this blog, we’ll help you navigate through what are the main reasons to implement an AI-augmented search solution by integrating both platforms. Let’s embark on this! Check out our AI resource page to learn more about building AI-powered apps with MongoDB. A solid foundation for your data model For an e-commerce site built around the principles of an Event Driven and MACH Architecture , the data layer will need to ingest and transform data from a number of different sources. Heterogeneous data, such as product catalog, user behavior on the e-commerce front-end, comments and ratings, search keywords, and customer lifecycle segmentation- all of this is necessary to personalize search results in real time. This increases the need for a flexible model such as in MongoDB’s documents and a platform that can easily take in data from a number of different sources- from API, CSV, and Kafka topics through the MongoDB Kafka Connector . MongoDB's Translytical capabilities, combining transactional (OLTP) and analytical (OLAP) offer real-time data processing and analysis, enabling you to simplify your workloads while ensuring timely responsiveness and cost-effectiveness. Now the data platform is servicing the operational needs of the application- what about adding in AI? Combining MongoDB with Databricks, using the MongoDB Spark Connector can allow you to train your models with your operational data from MongoDB easily and to trigger them to run in real-time to augment your application as the customer is using it. Centralization of heterogeneous data in a robust yet flexible Operational Data Layer The foundation of an effective e-commerce data layer lies in having a solid yet flexible operational data platform, so the orchestrating of ML models to run at specific timeframes or responding to different events, enabling crucial data transformation, metadata enrichment, and data featurization becomes a simple, automated task for optimizing search result pages and deliver a frictionless purchasing process. Check out this blog for a tutorial on achieving near real-time ingestion using the Kafka Connector with MongoDB Atlas, and data processing with Databricks Spark User Defined Functions. Adding relevance to your search engine results pages To achieve optimal product positioning on the Search Engine Results Page (SERP) after a user performs a query, retailers are challenged with creating a business score for their products' relevance. This score incorporates various factors such as stock levels, competitor prices, and price elasticity of demand. These business scores are complex real-time analyses calibrated against so many factors- it’s a perfect use case for AI. Adding AI-generated relevance to your SERPs can accurately predict and display search results that are most relevant to users' queries, leading to higher engagement and increased click-through rates, while also helping businesses optimize their content based on the operational context of their markets. The ingestion into the MongoDB Atlas document-based model laid the groundwork for this challenge, and leveraging the MongoDB Apache Spark Streaming Connector companies can persist their data into Databricks, taking advantage of its capabilities for data cleansing and complex data transformations, making it the ideal framework for delivering batch training and inference models. Diagram of the full architecture integrating MongoDB Atlas and Databricks for an e-commerce store, real-time analytics, and search MongoDB App Services act as the mortar of our solution, achieving an overlap of the intelligence layer in an event-driven way, making it not only real-time but also cost-effective and rendering both your applications and business processes nimble. Make sure to check out this GitHub repository to understand in depth how this is achieved. Data freshness Once that business score can be calculated comes the challenge of delivering it over the search feature of your application. With MongoDB Atlas native workload isolation, operational data is continuously available on dedicated analytics nodes deployed in the same distributed cluster, and exposed to analysts within milliseconds of being stored in the database. But data freshness is not only important for your analytics use cases, combining both your operational data with your analytics layer, retailers power in-app analytics and build amazing user experiences across your customer touch points. Considering MongoDB Atlas Search 's advanced features such as faceted search, auto-complete, and spell correction, retailers rest assured of a more intuitive and user-friendly search experience not only for their customers but for their developers, as it minimizes the tax of operational complexity as all these functionalities are bundled in the same platform. App-driven analytics is a competitive advantage against traditional warehouse analytics Additionally, the search functionality is optimized for performance, enabling businesses to handle high search query volumes without compromising user experience. The business score generated from the AI models trained and deployed with Databricks will provide the central point to act as a discriminator over where in the SERPs any of the specific products appear, rendering your search engine relevance fueled and securing the delivery of a high-quality user experience. Conclusion Search is a key part of the buying process for any customer. Showing customers exactly what they are looking for without investing too much time in the browsing stage reduces friction in the buying process, but as we’ve seen it might not be so easy technically. Empower your teams with the right tech stack to take advantage of the power of real-time in-app analytics with MongoDB and Databricks. It’s the simplest way to build AI and search capabilities into your e-commerce app, to respond to current and future market expectations. Check out the video below and this GitHub repository for all the code needed to integrate MongoDB and Databricks and deliver a real-time machine-learning solution for AI-augmented Search.

September 19, 2023

How to Seamlessly Use MongoDB Atlas and Databricks Lakehouse Together

In a previous post , we talked briefly about using MongoDB and Databricks together. In this post, we'll cover the different ways to integrate these systems, and why. Modern business demands expedited decision-making, highly-personalized customer experiences, and increased productivity. Analytical solutions need to evolve constantly to meet this demand of these changing needs, but legacy systems struggle to consolidate the data necessary to service these business needs. They silo data across multiple databases and data warehouses. They also slow turnaround speeds due to high maintenance and scaling issues. This performance hit becomes a significant bottleneck as the data grows into terabytes and petabytes. To overcome the above challenges, enterprises need a solution that can easily handle high transaction volume, paired with a scalable data warehouse (increasingly known as a "lakehouse") that performs both traditional Business Intelligence (BI) and advanced analytics like serving Machine Learning (ML) models. In our previous blog post “ Start your journey-operationalize AI enhanced real-time applications: mongodb-databricks ” we discussed how MongoDB Atlas and the Databricks Lakehouse Platform can complement each other in this context. In this blog post, we will deep dive on the various ways to integrate MongoDB Atlas and Databricks for a complete solution to manage and analyze data to meet the needs of modern business. Integration architecture Databricks Delta Lake is a reliable and secure storage layer for storing structured and unstructured data that enables efficient batch and streaming operations in the Databricks Lakehouse. It is the foundation of a scalable lakehouse solution for complex analysis. Data from MongoDB Atlas can be moved to Delta Lake in batch/real-time and can be aggregated with historical data and other data sources to perform long-running analytics and complex machine learning pipelines. This yields valuable insights. These Insights can be moved back to MongoDB Atlas so they can reach the right audience at the right time to be actioned. The data from MongoDB Atlas can be moved to Delta Lake in the following ways: One-time data load Real-time data synchronization One-time data load 1. Using Spark Connector The MongoDB Connector for Apache Spark allows you to use MongoDB as a data source for Apache Spark. You can use the connector to read data from MongoDB and write it to Databricks using the Spark API. To make it even easier, MongoDB and Databricks recently announced Databricks Notebooks integration , which gives you an even easier and more intuitive interface to write complex transformation jobs. Login to Databricks cluster, Click on New > Data . Click on MongoDB which is available under Native Integrations tab. This loads the pyspark notebook which provides a top-level introduction in using Spark with MongoDB. Follow the instructions in the notebook to learn how to load the data from MongoDB to Databricks Delta Lake using Spark. 2. Using $out operator and object storage This approach involves using the $out stage in the MongoDB aggregation pipeline to perform a one-time data load into object storage. Once the data is in object storage, it can be configured as the underlying storage for a Delta Lake. To make this work, you need to set up a Federated Database Instance to copy our MongoDB data and utilize MongoDB Atlas Data Federation's $out to S3 to copy MongoDB Data and land it in an S3 bucket. The first thing you'll need to do is navigate to "Data Federation" on the left-hand side of your Atlas Dashboard and then click "Create Federated Database Instance" or "Configure a New Federated Database Instance." Connect your S3 bucket to your Federated Database Instance. This is where we will write the MongoDB data. The setup wizard should guide you through this pretty quickly, but you will need access to your credentials for AWS. Select an AWS IAM role for Atlas. If you created a role that Atlas is already authorized to read and write to your S3 bucket, select this user. If you are authorizing Atlas for an existing role or are creating a new role, be sure to refer to the documentation for how to do this. Enter the S3 bucket information. Enter the name of your S3 bucket. Choose Read and write, to be able to write documents to your S3 bucket. Assign an access policy to your AWS IAM role. Follow the steps in the Atlas user interface to assign an access policy to your AWS IAM role. Your role policy for read-only or read and write access should look similar to the following: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject", "s3:GetObjectVersion", "s3:GetBucketLocation" ], "Resource": [ <role arn> ] } ] } Define the path structure for your files in the S3 bucket and click Next. Now you've successfully configured S3 bucket with Atlas Data Federation. Connect to your MongoDB instance using the MongoDB shell. This command prompts you to enter the password. mongosh "mongodb+srv://server.example.mongodb.net" --username username Specify the database and collection that you want to export data from using the following commands. use db_name; db.collection_name.find() Replace db_name and collection_name with actual values and verify the data exists by running the above command. Use the $out operator to export the data to an S3 bucket. db.[collection_name].aggregate([{$out: "s3://[bucket_name]/[folder_name]"}]) Make sure to replace [collection_name], [bucket_name] and [folder_name] with the appropriate values for your S3 bucket and desired destination folder. Note: The $out operator will overwrite any existing data in the specified S3 location, so make sure to use a unique destination folder or bucket to avoid unintended data loss. Real-time data synchronization Real-time data synchronization needs to happen immediately following the one-time load process. This can be achieved in multiple ways, as shown below. 1. Using Apache Kafka and Delta Live Table Streaming data from MongoDB to Databricks using Kafka and Delta Live Table Pipeline is a powerful way to process large amounts of data in real-time. This approach leverages Apache Kafka, a distributed event streaming platform, to receive data from MongoDB and forward it to Databricks in real-time. The data can then be processed using Delta Live Tables (DLT), which makes it easy to build and manage reliable batch and streaming data pipelines that deliver high-quality data on the Databricks Lakehouse Platform. Download and Install the MongoDB Source connector plugin in your Kafka Cluster from here . Update the following in the mongodb-source-connector.properties connector configuration file. CONNECTION-STRING - MongoDB Cluster Connection String DB-NAME - Database Name COLLECTION-NAME - Collection Name Note: These configurations can be modified based on the use case. Refer to this documentation for more details. Deploy the connector configuration file in your Kafka Cluster. This will enable real time data synchronization from MongoDB to Kafka Topic. Login to Databricks cluster, Click on New > Notebook . In create a notebook dialog, enter a Name , select Python as the default language, and choose the Databricks cluster. Then click on Create . Obtain the IPython notebook for DLT pipeline from here . Go to File > Import , and navigate to the notebook you downloaded in the previous step Click on Import to add the data streaming notebook to your workspace. Update the following variables in the notebook and save. TOPIC - Kafka Topic Name (i.e DB.COLLECTION name) KAFKA_BROKER - Kafka Bootstrap Server details API_KEY - Kafka Server API Key SECRET - Kafka Server Secret Now, Navigate to the sidebar and select the Workflows option. Within Workflows, choose the Delta Live Tables tab and select Create Pipeline . Give your pipeline a name and select Advanced for the product edition. Choose Continuous for the Pipeline Mode. Set the cluster_policy to none and select the notebook you created under Notebook Libraries. Optionally, you can choose to enter a storage location for the output data from the pipeline. If you leave the Storage location field blank, the system will use the default location. You can leave the settings in the Compute section at their default values. Click the Create button to create the pipeline. Run the pipeline to stream the data from Kafka to Delta Live Table. Refer to this documentation to learn more about Delta Live table. 2. Using Spark streaming MongoDB has released a version of the MongoDB Connector for Apache Spark that leverages the new Spark Data Sources API V2 with support for Spark Structured Streaming. MongoDB Connector for Apache Spark enables real-time micro-batch processing of data, enabling you to synchronize data from MongoDB to Databricks using Spark Streaming. This allows you to process data as it is generated, with the help of MongoDB's change data capture (CDC) feature to track all changes. By utilizing Spark Streaming, you can make timely and informed decisions based on the most up-to-date information available in Delta Lake. More details about the streaming functionality can be found here . Login to Databricks cluster, Click on New > Notebook . In create a notebook dialog, enter a Name , select Python as the default language, and choose the Databricks cluster. Then click on Create . Obtain the Spark streaming IPython notebook from here . Go to File > Import , and navigate to the notebook you downloaded in the previous step. Click on Import to add the data streaming notebook to your workspace. Follow the instructions in the notebook to learn how to stream the data from MongoDB to Databricks Delta Lake using Spark connector for MongoDB. 3. Using Apache Kafka and Object Storage Apache Kafka can be utilized as a buffer between MongoDB and Databricks. When new data is added to the MongoDB database, it is sent to the message queue using the MongoDB Source Connector for Apache Kafka. This data is then pushed to object storage using sink connectors, such as the Amazon S3 Sink connector. The data can then be transferred to Databricks Delta Lake using the Autoloader option, which allows for incremental data ingestion. This approach is highly scalable and fault-tolerant, as Kafka can process large volumes of data and recover from failures. Download and Install the MongoDB Source and AWS Sink Connector Plugin in your Kafka Cluster https://www.confluent.io/hub/mongodb/kafka-connect-mongodb https://www.confluent.io/hub/confluentinc/kafka-connect-s3 Update the following in the mongodb-source-connector.properties connector configuration file. CONNECTION-STRING - MongoDB Cluster Connection String DB-NAME - Database Name COLLECTION-NAME - Collection Name Update the following in the s3-sink-connector.properties connector configuration file. TOPIC-NAME - Kafka Topic Name (i.e DB.COLLECTION name) S3-REGION - AWS S3 Region Name S3-BUCKET-NAME - AWS S3 Bucket Name where you wish to push the data. Deploy the connector configuration files in your Kafka Cluster. This will enable real time data synchronization from MongoDB to AWS S3 Buckets. Note: The above connector pushes the data to the S3 bucket at a regular interval of time. These configuration can be modified based on the use case. Refer to the following documentation for more details. MongoDB Source Configuration AWS S3 Sink Configuration Load the data from S3 buckets to Databricks Delta lake using Databricks Autoloader feature. Refer to this documentation for more details. In conclusion, the integration between MongoDB Atlas and the Databricks Lakehouse Platform can offer businesses a complete solution for data management and analysis. The integration architecture between these two platforms is flexible and scalable, ensuring data accuracy and consistency. All the data you need for analytics is in one place in the Lakehouse. Whether it's through one-time data load or real-time data synchronization, the combination of MongoDB Atlas as an Operational Data Store (ODS) and Databricks Lakehouse as an Enterprise Data Warehouse/Lake (EDL) provides the ideal solution for modern enterprises looking to harness the value of their data. So, if you're struggling with the challenges of siloed data, slow decision-making, and outdated development processes, the integration of MongoDB Atlas and Databricks Lakehouse may be the solution you need to take your business to the next level. Please reach out to partners@mongodb.com for any questions.

February 27, 2023