MongoDB

88 results

MongoDB AI Applications Program Partner Spotlight: Cohere Brings Leading AI Foundation Models to the Enterprise

Today, Cohere, a leading enterprise AI platform, will join MongoDB’s new AI Applications Program (MAAP) as part of its first cohort of partners. The MAAP program is designed to help organizations rapidly build and deploy modern generative AI applications at enterprise scale. Enterprises will be able to utilize MAAP to more easily and quickly leverage Cohere’s industry-leading AI technology, such as its highly performant and scalable Command R series of generative models, into their businesses. Cohere's enterprise AI suite supports end-to-end retrieval augmented generation (RAG, which has become a foundational building block for enterprises adopting large language models (LLMs) and customizing them with their own proprietary data. Cohere’s Command R model series is optimized for business-critical capabilities like advanced RAG with citations to mitigate hallucinations, along with tools used to automate complex business processes. It also offers multilingual coverage in 10 key languages to support global business operations. These models are highly scalable, balancing high efficiency with strong accuracy for customers. Cohere’s best-in-class embed models complement its R Series generative models, offering enhanced enterprise search capabilities in 100+ languages to support powerful RAG applications. Using Cohere’s technology with MAAP will help companies overcome many of the obstacles that they face when implementing generative AI into their everyday operations. Enterprises can now seamlessly integrate Cohere’s state-of-the-art LLMs to move into large-scale production with AI to address real-world business challenges. MAAP provides a strategic framework utilizing MongoDB’s industry expertise, strategic roadmaps, and technology to design AI solutions that can meaningfully improve workforce productivity and deliver new types of application experiences to end users. “Organizations of all sizes across industries are eager to get started with applications enriched with generative AI capabilities but many are unsure how to get started effectively,” said Alan Chhabra, EVP of Worldwide Partners at MongoDB. “The MongoDB AI Applications Program helps address this challenge, and we’re excited to have Cohere as a launch partner for the program. With Cohere’s leading embedding models, support for more than 100 languages, and its Command R foundation models optimized for retrieval augmented generation using an organization’s proprietary data, customers can more easily help improve the accuracy and trustworthiness of outputs from AI-powered applications.” “MongoDB’s unique position in the market allows them to work with companies as they evaluate their current technology stack, and identify the best opportunities to use Cohere’s industry-leading Command and Embed LLMs to drive efficiency at scale,” said Vinod Devan, Cohere’s Global Head of Partnerships. “MAAP is an incredible opportunity for companies to work with a trusted partner as they look to meaningfully ramp up their use of Cohere’s enterprise-grade AI solutions to deliver real business value.” We look forward to building on this partnership to deliver impactful AI solutions for businesses globally. Cohere works with all major cloud providers as well as on-prem for regulated industries and privacy-sensitive use cases, to make their models universally available for customers wherever their data resides. MongoDB and Cohere will work together to be a trusted AI partner for enterprises and build cutting-edge applications with data privacy and security in mind for companies that need highly secure solutions for sensitive proprietary data. Learn more about the MongoDB AI Applications Program on the program website .

May 1, 2024

MongoDB at AWS re:Invent 2020

While 2020 has been a challenging year, it has also given rise to new levels of innovative collaboration and agile thinking. Where better to experience both than at AWS re:Invent 2020? At MongoDB, we’re excited to partner with AWS on this free, 3-week virtual event, providing unlimited access to hundreds of sessions led by Cloud experts. Although we’ll miss the grand, buzzing halls of the Venetian Hotel and the celebratory sounds of slot machines this year, it’s still important to approach AWS re:Invent with a focused plan. Think of this year’s event as an opportunity to curate your own perfectly tailored experience. Check out this page for details of our fresh new lineup of deep-dives, targeted jam sessions and — of course — the annual MongoDB late-night party. Here are some of the highlights. AWS Jam — "Excel isn't a database!" Imagine this: It's your first week in a new job, and the VP of sales has already given you an important data task. The good news? From the start of the year, all your current sales data has been stored in MongoDB Atlas — allowing operational and analytical workloads to run on the live data set. The not-so-good news? That wasn't always the case. For years before they switched, their database (well, ”database”) of choice was… Excel. Fortunately someone took the initiative to export that data in CSV format and store it in S3, but now the sales team needs your help to analyze that data — and they need it fast. In our “Excel isn’t a database!” Jam Session, you’ll test and upgrade your skills by connecting MongoDB Atlas Data Lake to CSV data that’s been languishing in an S3 bucket. Then you’ll run an aggregation to complete the challenge and claim points. Game on! This jam session will be available on-demand for the duration of AWS re:Invent Databases & S3: Auto-archiving Breakout Session Databases are built for fast access, but this can also make them resource-intensive. As data grows, you may want to optimize performance (or cost) by migrating old or infrequently used data into cheap object storage. But this presents its own problems: automating the archival process, ensuring data consistency during failures, and either querying two data stores separately or building a query federation system. In this talk, you’ll learn about how we approached these problems while building Online Archive and Federated Query features into MongoDB Atlas, lessons learned from the experience, and how you can do the same. MongoDB Late Nite That’s right: it’s a party! In the spirit of Vegas, MongoDB will be hosting an interactive late-night bash complete with throw-back entertainment at our virtual after-hours event. Like Vegas, there’s something for everyone. Unlike Vegas, the odds are actually on your side. Get your adrenaline going and dial in for exclusive swag at our Home Shopping Network. Just sign on and dial into our custom QVC-reboot every hour for a chance to snag some really cool limited-release items. Stay tuned to the event website to find out what you can win, and when! Are you a Jeopardy lover? MongoDB Late Nite is your time to shine. Exercise your mental reflexes and get those synapses firing with hundreds of other party people inside episodes of dev-focused live trivia. And what kind of revelry is complete without a resident psychic on board? Join us at the Future of Coding for an interactive reading by a VERY accurate psychic. So kick back, grab a beverage and join us at the party from home. Let’s get in the spirit together! Sponsor Page/Online Booth Pop into our virtual sponsor booth at your convenience. Our product experts will be there to answer your questions one-on-one. Alternatively, if casually exploring resources is more your style, check out our self-serve content playlists. View these to dig deeper into MongoDB education, glean customer success stories and get up to speed on the latest product features.

November 25, 2020

Idea to Production In Less Than 12 Months: Building a Flexible Smart Home Platform Using MongoDB at Vivint

America’s Largest Smart Home Solutions Company Scales to 500m Data Points per Day on Geo-Distributed MongoDB cluster We’re really interested in delivering a lifestyle experience. It’s not just about cool flashy features in your house. We want really to make your life simpler. That’s how Nick Brown, Principal Software Architect at Vivint , explained his company’s vision to tech publication Datanami at MongoDB World 2016 . Nick was at the event to present on the challenges and triumphs of a high-risk, ultra urgent, ground-breaking Internet of Things implementation his team had created. Vivint needed a database that would scale to millions of devices, had ultra-low latency response times and was flexible enough that it could allow developers to build quickly and make changes on the fly. The entire platform had to be created in less than a year to fit in with Vivint’s hardware and sales cycle. So Nick’s team designed and built a brand new integration platform based around MongoDB for this urgent, but complex, project. The Vivint Sky platform , which has been live for two years, is the interface that gives millions of customers access to Vivint’s suite of smart home devices as well as integrating with other hardware providers. These could include anything from NEST thermostats, connected fridges, smart locks or security cameras – all which can be accessed and controlled from any device via the Sky User Interface. The platform is now running so smoothly that Vivint has been able to free up its database administrator resources to focus on more strategic responsibilities. They’re not longer burdened with low level operational tasks. As Nick puts it the traditional DBA activity takes about 5% of one person’s time. It wasn’t all smooth sailing though and Nick’s presentation also covered what happened the first Halloween evening his team faced. It’s not just trick and treaters who are working hard on that night, the massive increase in residential activity results in some interesting problems for the smart home industry too. Nick also covers: Schema design best practices for time-series sensor data Shard key selection The dramatic performance improvements achieved by upgrading to MongoDB with the WiredTiger storage engine How the MongoDB deployment has grown to 10 shards on 50 nodes, distributed across multiple geographies To get more detail on how Vivint built its smart home platform, and to find out exactly what happened on Halloween, check out Nick’s slides from MongoDB World: To learn more about the Internet of Things and MongoDB download this free white paper: Bosch Software Innovations and MongoDB: Internet of Things and Big Data

July 27, 2016

Retail Reference Architecture Part 4: Recommendations and Personalizations

Series: Building a Flexible, Searchable, Low-Latency Product Catalog Approaches to Inventory Optimization Query Optimization and Scaling Recommendations and Personalizations In the first three parts of our series on retail reference architecture, we focused on two practical applications of MongoDB in the retail space: product catalogs and inventory systems. Both of these are fairly conventional use cases where MongoDB acts as a system of record for a relatively static and straightforward collection of data. For example, in part one of our series where we focused on the product catalog, we used MongoDB to store and retrieve an inventory of items and their variants. Today we&#x2019;ll be looking at a very different application of MongoDB in the retail space, one that even those familiar with MongoDB might not think it is well suited for: logging a high volume of user activity data and performing data analytics. This final use case demonstrates how MongoDB can enable scalable insights, including recommendations and personalization for your customers. Activity Logging In retail, maintaining a record of each user&#x2019;s activities gives a company the means to gain valuable predictive insight into user behavior, but it comes at a cost. For a retailer with hundreds of thousands or millions of customers, logging all of the activities generated by our customer base creates a huge amount of data, and storing that data in a useful and accessible way becomes a challenging task. The reason for this is that just about every activity performed by a user can be of interest to us, such as: Search Product views, likes or wishes Shopping cart add/remove Social network sharing Ad impressions Even from this short list, it&#x2019;s easy to see how the amount of data generated can quickly become problematic, both in terms of the cost/volume of storage needed, and a company&#x2019;s ability to utilize the data in a meaningful way. After all, we&#x2019;re talking about potentially hundreds of thousands of writes per second, which means to gain any insights from our data set we are effectively trying to drink from the fire hose. The potential benefits, however, are huge. With this type of data a retailer can gain a wealth of knowledge that will help predict user actions and preferences for the purposes of upselling and cross-selling. In short, the better any retailer can predict what their users want, the more effectively they can drive a consumer to additional products they may want to purchase. Requirements For MongoDB to meet the needs of our use case, we need it to handle the following requirements: Ingestion of hundreds of thousands of writes per second: Normally MongoDB performs random access writes. In our use case this could lead to an unacceptable amount of disk fragmentation, so we used HVDF (more on this in a later) to store the data sequentially in an append-only fashion. Flexible schema: To minimize the amount of storage space required, the schema of each activity being logged is stored in the same format and size as it is received. Fast querying and sorting on varied fields: Secondary Btree indexes ensure that our most common lookups and sorts will be performed in milliseconds. Easy deletes of old data: Typically, deleting large numbers of documents is a relatively expensive operation in MongoDB. By time partitioning our data into collections using HVDF we are able to drop entire collections as a free operation. Data Model As mentioned earlier, one of the requirements of our solution is the use of flexible schemas so that data is stored in the same format it is received; however, we do still need to put some thought into a general data model for each activity being recorded. The following is an example that outlines some of the attributes we may want to capture across all samples: { _id: ObjectId(), geoCode: 1, // used to localize write operations sessionId: “2373BB…", // tracks activities across sessions device: { id: “1234", // tracks activities across different user devices type: "mobile/iphone", userAgent: "Chrome/34.0.1847.131" } type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity itemId: “301671", // item that was viewed, added to cart, etc. sku: “730223104376", //item sku order: { id: “12520185", // info about orders associated with the activity … }, location: [ -86.95444, 33.40178 ], //user’s location when the activity was performed tags: [ "smartphone", "iphone", … ], // associated tags timeStamp: Date("2014/04/01 …”) // time the activity was performed } This is just one possibility of what an activity might look like. A major concern here is to persist only as much information as is necessary for each activity type to minimize required disk space. As a result, each document will vary depending on the type of activity data being captured. High Volume Data Feed (HVDF) HVDF is an open-source framework created by the team at MongoDB, which makes it easy to efficiently validate, store, index, query and purge time series data in MongoDB via a simple REST API. In HVDF, incoming data consists of three major components: Feed: Maps to a database. In our use case, we will have one feed per user to capture all of their activities. Channel: Maps to a collection. Each channel represents an activity type or source being logged. Sample: Maps to a document. A separate document is written for each user activity being logged. HVDF allows us to easily use MongoDB in an append-only fashion, meaning that our documents will be stored sequentially, minimizing any wasted space when we write to disk. As well, HVDF handles a number of configuration details that make MongoDB more efficient for this type of high-volume data storage, including time-based partitioning , and disabling of the default power of 2 sizes disk allocation strategy. More information and the source code for HVDF are available on GitHub . Time Partitioning In our usage, we take advantage of HVDF&#x2019;s time slicing feature, which automatically time partitions samples by creating a new collection at a specified time interval. This has four advantages: Sequential writes: Since a new collection is created for each time interval, samples are always written to disk sequentially. Fast deletes: By treating the data as sets, deletes are essentially free. All we need to do is drop the collection. Index size: User activity logging creates a huge amount of data. Were we to create one collection for all samples per channel our index of of those samples could quickly become huge. By slicing our channels into time intervals, we keep our indexes small enough to fit in RAM, meaning queries using those indexes remain very performant. Collections optimized for reads: The time interval for each collection can be configured to match the interval we are most likely to want to retrieve. In general, a best practice is to keep time partitions small enough that their indexes will fit in RAM, but large enough that you will only need to query across two collections for any given query. Automatic sharding: HVDF automatically creates a shard key for each time interval as the collection is created. This ensures any lookups for a given time interval are performant, since samples for the same interval are written to the same shard. To specify how our data will be time partitioned, we simply pass the following to our channel configuration : { "time_slicing" : { "type" : "periodic", "config" : { "period" : {"weeks" : 4} } } } In this example, we are configuring HVDF to create a new collection per channel every 4 weeks. The &#x2018;period&#x2019; may be specified in years, weeks, hours, minutes, seconds, milliseconds or any combination of these. Setting _id To keep our queries highly performant, we also need to put some thought into the construction of the &#x2018;_id&#x2019; for our documents. Since we know this data is going to be used primarily for performing time-bounded user analytics, it&#x2019;s logical to choose an &#x2018;_id&#x2019; that embeds both the creation timestamp and the &#x2018;userID&#x2019;. HVDF makes this very simple. All we need to do is specify that HVDF should use the the &#x2018;source_time_document&#x2019; id_type channel plugin in the channel configuration: "storage" : { "type" : "raw", "config" : { "id_factory" : { "type" : "source_time_document", "config" : { } } } } ``` This will give us an ‘_id’ that looks like this: "_id" : { "source" : "user457", "ts" : NumberLong(324000000) } </code></pre> <p>By using this format for &#x2018;_id&#x2019;, we guarantee near instantaneous queries for many of the most common queries we will need in order to perform analytics on our data.</p> <h2 id="user-insights">User Insights</h2> <p>Now that we&#x2019;ve figured out how our user&#x2019;s activities will be logged in MongoDB, it&#x2019;s time to actually look at what we can do with that data. The first important thing to keep in mind is that, as mentioned earlier, we are trying to drink from the fire hose. Capturing this type of user data very often leads to huge data sets, so all of our queries must be time bound. Not only does this limit the number of collections that are hit by each query thanks to our time partitioning, it also keeps the amount of data being searched sane. When dealing with terabytes of data, a full table scan isn&#x2019;t even an option.</p> <h3 id="queries">Queries</h3> <p>Here are a couple of simple queries we might commonly use to gain some insight into a user&#x2019;s behavior. These would require secondary indexes on &#x2018;userId&#x2019; and &#x2018;itemId&#x2019;:</p> <ul> <li>Recent activity for a user:</li> </ul> <pre><code>db.activity.find({ userId: “u123”, ts: {“$g”t:1301284969946, // time bound the query “$lt”: 1425657300} }) .sort({ time: -1 }).limit(100) // sort in desc order </code></pre> <ul> <li>Recent activity for a product:</li> </ul> <pre><code>db.activity.find({ itemId: “301671", // requires a secondary index on timestamp + itemId ts: {“$g”t:1301284969946, “$lt”: 1425657300} }) .sort({ time: -1 }).limit(100)</code></pre> <p>To gain even more insight into what our users are doing, we can also make use of the <a href="http://docs.mongodb.org/manual/aggregation/">aggregation framework</a> in MongoDB:</p> <ul> <li>Recent number of views, purchases, etc for user</li> </ul> <pre><code>db.activities.aggregate(([ { $match: { userId: "u123", ts: { $gt: DATE }}}, // get a time bound set of activities for a user { $group: { _id: "$type", count: { $sum: 1 }}}]) // group and sum the results by activity type </code></pre> <ul> <li>Recent total sales for a user</li> </ul> <pre><code>db.activities.aggregate(([ { $match: { userId:"u123", ts:{$gt:DATE}, type:"ORDER"}}, { $group: { _id: "result", count: {$sum: "$total" }}}]) // sum the total of all orders</code></pre> <p>-Recent number of views, purchases, etc for item</p> <pre><code>db.activities.aggregate(([ { $match: { itemId: "301671", ts: { $gt: DATE }}}, { $group: { _id: "$type", count: { $sum: 1 }}}])</code></pre> <p>There are, of course, many different things the results of these queries might tell us about our user&#x2019;s behaviors. For example, over time, we might see that the same user has viewed the same item multiple times, meaning we might want to suggest that item to them the next time they visit our site. Similarly, we may find that a particular item has had a surge in purchases due to a sale or ad campaign, meaning we might want to suggest it to users more frequently to further drive up sales. </p> <p>Up-selling is a big possibility here as well. For example, let&#x2019;s say we know a user has looked at several different low-end cell phones, but that in general their total purchases are relatively high. In this case, we might suggest that they look at a higher-end device that is popular amongst users with similar stats.</p> <h3 id="map-reduce">Map-reduce</h3> <p>Another option for performing analytics with MongoDB is to use <a href="http://docs.mongodb.org/manual/core/map-reduce/">map-reduce</a>. This is often a good choice for very large data sets when compared with the aggregation framework for a few reasons. For example, take the following aggregation that calculates the total number of times an activity has been logged in the past hour for all unique visitors:</p> <pre><code> db.activities.aggregate(([ { $match: { time: { $gt: NOW-1H } }}, { $group: { _id: "$userId", count: {$sum: 1} } }], { allowDiskUse: 1 })</code></pre> <p>Here we have several problems:</p> <ul> <li><p>allowDiskUse: In MongoDB, any aggregation that involves a data set greater than 100MB will throw an error, since the aggregation framework is designed to be executed in RAM. Here, the <code>allowDiskUse</code> property mitigates this issue by forcing the aggregation to be performed on disk, which as you might imagine gets extremely slow for a very large data set.</p> </li> <li><p>Single shard final grouping: Calculating uniques is hard for any system, but there is a particular aspect of the aggregation framework that can make performance suffer heavily. In a sharded MongoDB instance, our aggregation would first be performed per shard. Next, the results would be recombined on the primary shard and then the aggregation would be run again to remove any duplication across shards. This can become prohibitively expensive for a very large result set, particularly if the data set is large enough to require disk to be used for the aggregation.</p> </li> </ul> <p>For these reasons, while map-reduce is <a href="http://blog.mongodb.org/post/62900213496/qaing-new-code-with-mms-map-reduce-vs">not always a more performant option</a> than the aggregation framework, it is better for this particular use case. Plus, the implementation is quite simple. Here&#x2019;s the same query performed with map-reduce:</p> <pre><code> var map = function() { emit(this.userId, 1); } // emit samples by userId with value of 1 var reduce = function(key, values) { return Array.sum(values); } // sum the total samples per user db.activities.mapreduce(map, reduce, { query: { time: { $gt: NOW-1H } }, // time bound the query to the last hour out: { replace: "lastHourUniques", sharded: true }) // persist the output to a collection</code></pre> <p>From here we can execute queries against the resulting collection:</p> <ul> <li>Number activities for a user in the last hour</li> </ul> <pre><code>db.lastHourUniques.find({ userId: "u123" })</code></pre> <ul> <li>Total uniques in the last hour</li> </ul> <pre><code>db.lastHourUniques.count()</code></pre> <h3 id="cross-selling">Cross-selling</h3> <p>Now to put it all together. One of the most valuable features our analytics can enable is the ability to cross-sell products to our users. For example, let&#x2019;s say a user buys a new iPhone. There&#x2019;s a vast number of related products they might also want to buy, such as cases, screen protectors, headphones, etc. By analyzing our user activity data, we can actually figure out what related items are commonly bought together and present them to the user to drive sales of additional items. Here&#x2019;s how we do it.</p> <p>First, we calculate the items bought by a user:</p> <pre><code> var map = function() { emit(userId, this.itemId); } // emit samples as userId: itemId var reduce = function(key, values) { return values; } // Return the array of itemIds the user has bought db.activities.mapReduce(map, reduce, { query: { type: “ORDER”, time: { $gt: NOW-24H }}, // include only activities of type order int he last day out: { replace: "lastDayOrders", sharded: true }) // persist the output to a collection</code></pre> <p>This will give an output collection that contains documents that look like this:</p> <pre><code>{ _id: "u123", items: [ 2, 3, 8 ] }</code></pre> <p>Next, we run a second map-reduce job on the outputed &#x2018;lastDayOrders&#x2019; collection to compute the number of occurrences of each item pair:</p> <pre><code> var map = function() { for (i = 0; i < this.items.length; i++) { for (j = i + 1; j <= this.items.length; j++) { emit({a: this.items[i] ,b: this.items[j] }, 1); // emit each item pair } } } var reduce = function(key, values) { return Array.sum(values); } // Sum all occurrences of each item pair db.lastDayOrders.mapReduce(map, reduce, {out: { replace: "pairs", sharded: true }) // persist the output to a collection</code></pre> <p>This will give us an output collection that contains a count of how many times each set of items has been ordered together. The collection will contain documents that look like this:</p> <pre><code>{ _id: { a: 2, b: 3 }, value: 36 }</code></pre> <p>All we need to do from here is to create a secondary index on each item and its count, e.g. <code>{_id.a:&#x201D;u123&#x201D;, count:36}</code>. This makes determining which items to cross-sell a matter of a simple query to retrieve the items most commonly ordered together. In the following example, we retrieve only those items that have been ordered with itemId u123 more than 10 times, and sort in descending order: </p> <pre><code>db.pairs.find( { _id.a: "u123", count: { $gt: 10 }} ) .sort({ count: -1 })</code></pre> <p>We can also optionally make the data more compact by grouping the most popular pairs in a separate collection, like this:</p> <pre><code>{ itemId: 2, recom: [ { itemId: 32, weight: 36}, { itemId: 158, weight: 23}, … ] }</code></pre> <h2 id="hadoop-integration">Hadoop Integration</h2> <p>For many applications, either using the aggregation framework for real-time data analytics, or the map-reduce function for less frequent handling of large data sets in MongoDB are great options, but clearly, for complex number crunching of truly massive datasets, <a href="http://hadoop.apache.org">Hadoop</a> is going to be the tool of choice. For this reason, we have created a <a href="https://github.com/mongodb/mongo-hadoop">Hadoop connector</a> for MongoDB that makes processing data from a MongoDB cluster in Hadoop a much simpler task. Here are just a few of the features:</p> <ul> <li>Read/write from MongoDB or exported BSON</li> <li>Integration with Pig, Spark, Hive, MapReduce and others</li> <li>Works with Apache Hadoop, Cloudera CDH, Hortonworks HDP</li> <li>Open source</li> <li>Support for filtering with MongoDB queries, authentication, reading from replica set tags, and more</li> </ul> <p>For more information on the MongoDB Hadoop connector, check out the project on <a href="https://github.com/mongodb/mongo-hadoop">GitHub</a>.</p> <h2 id="recap">Recap</h2> <p>Performing user analytics, whether in real-time or incrementally, is no small task. Every step of the process from capturing user data, to data storage, to data retrieval, to crunching the numbers present challenges that require significant planning if the system is to perform well. Luckily, MongoDB and HVDF can be great tools for accomplishing these types of tasks, either on their own or used in conjunction with other tools like Hadoop.</p> <p>That&#x2019;s it for our four part series on retail reference architecture! In just four blog posts we&#x2019;ve covered how MongoDB can be used to power some of the most common and complex needs of large-scale retailers, including <a href="https://www.mongodb.com/blog/post/retail-reference-architecture-part-1-building-flexible-searchable-low-latency-product">product catalogs</a>, <a href="https://www.mongodb.com/blog/post/retail-reference-architecture-part-2-approaches-inventory-optimization">inventory</a>, and <a href="https://www.mongodb.com/blog/post/retail-reference-architecture-part-3-query-optimization-and-scaling">query optimization</a>. For more tools and examples of what MongoDB can do for you, check out our <a href="https://github.com/mongodb/">GitHub</a>.</p> <h2 id="learn-more">Learn more</h2> <p>To discover how you can re-imagine the retail experience with MongoDB, <a href="https://www.mongodb.com/lp/white-paper/serving-the-digitally-oriented-consumer?jmp=blog" target="_blank">read our white paper</a>. In this paper, you&apos;ll learn about the new retail challenges and how MongoDB addresses them.</p> <p></p><hr> Learn more about how leading brands differentiate themselves with technologies and processes that enable the omni-channel retail experience.<p></p> <p></p><center><a class="btn btn-primary" href="https://www.mongodb.com/collateral/serving-digitally-oriented-consumer?jmp=blog" target="_BLANK">Read our guide on the digitally oriented consumer</a></center><p></p> <hr> <p style="float:left; margin-bottom:0px;font-size: 18px;"><strong><a href="https://www.mongodb.com/blog/post/retail-reference-architecture-part-3-query-optimization-and-scaling">&lt;&lt; Read Part 3</a></strong></p>

June 4, 2015

Retail Reference Architecture Part 3: Query Optimization and Scaling

Series: Building a Flexible, Searchable, Low-Latency Product Catalog Approaches to Inventory Optimization Query Optimization and Scaling Recommendations and Personalizations In part one of this series on reference architectures for retail, we discussed how to use MongoDB as the persistence layer for a large product catalog . In part two, we covered the schema and data model for an inventory system . Today we&#x2019;ll cover how to query and update inventory, plus how to scale the system. Inventory Updates and Aggregations At the end of the day, a good inventory system needs to be more than just a system of record for retrieving static data. We also need to be able to perform operations against our inventory, including updates when inventory changes, and aggregations to get a complete view of what inventory is available and where. The first of these operations, updating the inventory, is both pretty straightforward and nearly as performant as a standard query, meaning our inventory system will be able to handle the high-volume we would expect to receive. To do this with MongoDB we simply retrieve an item by its &#x2018;productId&#x2019;, then execute an in-place update on the variant we want to update using the $inc operator: db.inventory.update( { “storeId”:”store100”, “productId”:“20034”, “vars.sku”:”sku11736” }, {“$inc”:{“vars.$.q”:20}} ) For aggregations of our inventory, the aggregation pipeline framework in MongoDB gives us many valuable views of our data beyond simple per-store inventory by allowing us to take a set of results and apply multi-stage transformations. For example, let&#x2019;s say we want to find out how much inventory we have for all variants of a product across all stores. To get this we could create an aggregation request like this: db.inventory.aggregate([ {$match:{productId:”20034”}, {$unwind:”$vars”}, {$group:{ _id:”$result”, count:{$sum:”$vars.q”}}}]) Here, we are retrieving the inventory for a specific product from all stores, then using the $unwind operator to expand our variants array into a set of documents, which are then grouped and summed. This gives us a total inventory count for each variant that looks like this: {“_id”: “result”, “count”: 101752} Alternatively, we could have also matched on &#x2018;storeId&#x2019; rather than &#x2018;productId&#x2019; to get the inventory of all variants for a particular store. Thanks to the aggregation pipeline framework, we are able to apply many different operations on our inventory data to make it more consumable for things like reports and gain real insights into the information available. Pretty awesome? But wait, there&#x2019;s more! Location-based Inventory Queries So far we&#x2019;ve primarily looked at what retailers can get out of our inventory system from a business perspective, such as tracking and updating inventory, and generating reports, but one of the most notable strengths of this setup is the ability to power valuable customer-facing features. When we began architecting this part of our retail reference architecture, we knew that our inventory would also need to do more than just give an accurate snapshot of inventory levels at any given time, it would also need to support the type of location-based querying that has become expected in consumer mobile and web apps. Luckily, this is not a problem for our inventory system. Since we decided to duplicate the geolocation data from our stores collection into our inventory collection, we can very easily retrieve inventory relative to user location. Returning to the geoNear command that we used earlier to retrieve nearby stores, all we need to do is add a simple query to return real-time information to the consumer, such as the available inventory of a specific item at all the stores near them: db.runCommand({ geoNear:”inventory”, near:{ type:”Point”, coordinates:[-82.8006,40.0908]}, maxDistance:10000, spherical:true, limit:10, query:{“productId”:”20034”, “vars.sku”:”sku1736”}}) Or the 10 closest stores that have the item they are looking for in-stock: db.runCommand({ geoNear:”inventory”, near:{ type:”Point”, coordinates:[-82.8006,40.0908]}, maxDistance:10000, spherical:true, limit:10, query:{“productId”:”20034”, “vars”:{ $elemMatch:{”sku1736”, //returns only documents with this sku in the vars array quantity:{$gt:0}}}}}) //quantity greater than 0 Since we indexed the &#x2018;location&#x2019; attribute of our inventory documents, these queries are also very performant, a necessity if the system is supporting the type of high-volume traffic commonly generated by consumer apps, while also supporting all the transactions from our business use case. Deployment Topology At this point, it&#x2019;s time to celebrate, right? We&#x2019;ve built a great, performant inventory system that supports a variety of queries, as well as updates and aggregations. All done! Not so fast. This inventory system has to support the needs of a large retailer. That means it has to not only be performant for local reads and writes, it must also support requests spread over a large geographic area. This brings us to the topic of deployment topology. Datacenter Deployment We chose to deploy our inventory system across three datacenters, one each in the west, central and east regions of the U.S. We then sharded our data based on the same regions, so that all stores within a given region would execute transactions against a single local shard, minimizing any latency across the wire. And lastly, to ensure that all transactions, even those against inventory in other regions, were executed against the local datacenter, we replicated each all three shards in each datacenters. Since we are using replication, there is the issue of eventual consistency should a user in one region need to retrieve data about inventory in another region, but assuming a good data connection between datacenters and low replication-lag, this is minimal and worth the trade-off for the decrease in request latency, when compared to making requests across regions. Shard Key Of course, when designing any sharded system we also need to carefully consider what shard key to use. In this case, we chose {storeId:1},{productId:1} for two reasons. The first was that using the &#x2018;storeId&#x2019; ensured all the inventory for each store was written to the same shard. The second was cardinality. Using &#x2019;storeId&#x2019; alone would have been problematic, since even if we had hundreds of stores, we would be using a shard key with relatively low cardinality, a definite problem that could lead to an unbalanced cluster if we are dealing with an inventory of hundreds of millions or even billions of items. The solution was to also include &#x2018;productId&#x2019; in our shard key, which gives us the cardinality we need, should our inventory grow to a size where multiple shards are needed per region. Shard Tags The last step in setting up our topology was ensuring that requests were sent to the appropriate shard in the local datacenter. To do this, we took advantage of tag-aware sharding in MongoDB, which associates a range of shard key values with a specific shard or group of shards. To start, we created a tag for the primary shard in each region: - sh.addShardTag(“shard001”,”west”) - sh.addShardTag(“shard002”,”central”) - sh.addShardTag(“shard003”,”east”) Then assigned each of those tags to a range of stores in the same region: - sh.addTagRange(“inventoryDB.inventory”),{storeId:0},{storeId:100},”west”) - sh.addTagRange(“inventoryDB.inventory”),{storeId:100},{storeId:200},”central”) - sh.addTagRange(“inventoryDB.inventory”),{storeId:200},{storeId:300},”east”) In a real-world situation, stores would probably not fall so neatly into ranges for each region, but since we can assign whichever stores we want to any given tag, down to the level of assigning a single storeId to a tag, it allows the flexibility to accommodate our needs even if storeIds are more discontinuous. Here we are simply assigning by range for the sake of simplicity in our reference architecture. Recap Overall, the process of creating our inventory system was pretty simple, requiring relatively few steps to implement. The more important takeaway than the finished system itself is the process we used to design it. In MongoDB, careful attention to schema design, indexing and sharding are critical for building and deploying a setup that meets your use case, while ensuring low latency and high performance, and as you can see, MongoDB offers a lot of tools for accomplishing this. Up next in the final part of our retail reference architecture series: scalable insights, including recommendations and personalization! Learn More To discover how you can re-imagine the retail experience with MongoDB, read our white paper . In this paper, you'll learn about the new retail challenges and how MongoDB addresses them. Learn more about how leading brands differentiate themselves with technologies and processes that enable the omni-channel retail experience. Read our guide on the digitally oriented consumer << Read Part 2 Read Part 4 >>

May 20, 2015

Retail Reference Architecture Part 2: Approaches to Inventory Optimization

Series: Building a Flexible, Searchable, Low-Latency Product Catalog Approaches to Inventory Optimization Query Optimization and Scaling Recommendations and Personalizations In part one of our series on retail reference architecture we looked at some best practices for how a high-volume retailer might use MongoDB as the persistence layer for a large product catalog. This involved index, schema, and query optimization to ensure our catalog could support features like search, per-store pricing and browsing with faceted search in a highly performant manner. Over the next two posts we will be looking at approaches to similar types of optimization, but applied to an entirely different aspect of retail business, inventory. A solid central inventory system that is accessible across a retailer&#x2019;s stores and applications is a large part of the foundation needed for improving and enriching the customer experience. Here are just a few of the features that a retailer might want to enable: Reliably check real-time product availability. Give the option for in-store pick-up at a particular location. Detect the need for intra-day replenishment if there is a run on an item. The Problem with Inventory Systems These are features that seem basic but they present real challenges given the types of legacy inventory systems commonly used by major retailers. In these systems, individual stores keep their own field inventories, which then report data back to the central RDBMS at a set time interval, usually nightly. That RDBMS then reconciles and categorizes all of the data received that day and makes it available for operations like analytics, reporting, as well as consumption by external and internal applications. Commonly there is also a caching layer present between the RDBMS and any applications, as relational databases are often not well-suited to the transaction volume required by such clients, particularly if we are talking about a consumer-facing mobile or web app. So the problem with the status quo is pretty clear. The basic setup of these systems isn&#x2019;t suited to providing a continually accurate snapshot of how much inventory we have and where that inventory is located. In addition, we also have the increased complexity involved in maintaining multiple systems, i.e. caching, persistence, etc. MongoDB, however, is ideal for supporting these features with a high degree of accuracy and availability, even if our individual retail stores are very geographically dispersed. Design Principles To begin, we determined that the inventory system in our retail reference architecture needed to do the following: Provide a single view of inventory, accessible by any client at any time. Be usable by any system that needs inventory data. Handle a high-volume, read-dominated workload, i.e. inventory checks. Handle a high volume of real-time writes, i.e. inventory updates. Support bulk writes to refresh the system of record. Be geographically distributed. Remain horizontally scalable as the number of stores or items in inventory grows. In short, what we needed was to build a high performance, horizontally scalable system where stores and clients over a large geographic area could transact in real-time with MongoDB to view and update inventory. Stores Schema Since a primary requirement of our use case was to maintain a centralized, real-time view of total inventory per store, we first needed to create the schema for a stores collection so that we had locations to associate our inventory with. The result is a fairly straightforward document per store: { “_id”:ObjectId(“78s89453d8chw28h428f2423”), “className”:”catalog.Store”, “storeId”:”store100”, “name”:”Bessemer Store”, “address”:{ “addr1”:”1 Main St.”, “city”:”Bessemer”, “state”:”AL”, “zip”:”12345”, “country”:”USA” }, “location”:[-86.95444, 33.40178], … } We then created the following indices to optimize the most common types of reads on our store data: {&#x201C;storeId&#x201D;:1},{&#x201C;unique&#x201D;:true} : Get inventory for a specific store. {&#x201C;name&#x201D;:1} : Get a store by name. {&#x201C;address.zip&#x201D;:1} : Get all stores within a zip code, i.e. store locator. {&#x201C;location&#x201D;: 2dsphere} : Get all stores around a specified geolocation. Of these, the location index is especially useful for our purposes, as it allows us to query stores by proximity to a location, e.g. a user looking for the nearest store with a product in stock. To take advantage of this in a sharded environment, we used a geoNear command that retrieves the documents whose &#x2018;location&#x2019; attribute is within a specified distance of a given point, sorted nearest first: db.runCommand({ geoNear:“stores”, near:{ type:”Point”, coordinates:[-82.8006,40.0908], //GeoJSON or coordinate pair maxDistance:10000.0, //in meters spherical:true //required for 2dsphere indexes } }) This schema gave us the ability to locate our objects, but the much bigger challenge was tracking and managing the inventory in those stores. Inventory Data Model Now that we had stores to associate our items with, we needed to create an inventory collection to track the actual inventory count of each item and all its variants. Some trade-offs were required for this, however. To both minimize the number of roundtrips to the database, as well as mitigate application-level joins, we decided to duplicate data from the stores collection into the inventory collection. The document we came up with looked like this: { “_id”:”902372093572409542jbf42r2f2432”, “storeId”:”store100”, “location”:[-86.95444, 33.40178], “productId”:”20034”, “vars”:[ {“sku”:”sku1”, “quantity”:”5”}, {“sku”:”sku2”, “quantity”:”23”}, {“sku”:”sku3”, “quantity”:”2”}, … ] } Notice first that we included both the &#x2018;storeId&#x2019; and &#x2018;location&#x2019; attribute in our inventory document. Clearly the &#x2018;storeId&#x2019; is necessary so that we know which store has what items, but what happens when we are querying for inventory near the user? Both the inventory data and store location data are required to complete the request. By adding geolocation data to the inventory document we eliminate the need to execute a separate query to the stores collection, as well as a join between the stores and inventory collections. For our schema we also decided to represent inventory in our documents at the productId level. As was noted in part one of our retail reference architecture series, each product can have many, even thousands of variants, based on size, color, style, etc., and all these variants must be represented in our inventory. So the question is should we favor larger documents that contain a potentially large variants collection, or many more documents that represent inventory at the variant level? In this case, we favored larger documents to minimize the amount of data duplication, as well as decrease the total number of documents in our inventory collection that would need to be queried or updated. Next, we created our indices: {storeId:1} : Get all items in inventory for a specific store. {productId:1},{storeId:1} : Get inventory of a product for a specific store. {productId:1},{location:&#x201D;2dsphere&#x201D;} : Get all inventory of a product within a specific distance. It&#x2019;s worth pointing out here that we chose not to include an index with &#x2018;vars.sku&#x2019;. The reason for this is that it wouldn&#x2019;t actually buy us very much, since we are already able to do look ups in our inventory based on &#x2018;productID&#x2019;. So, for example, a query to get a specific variant sku that looks like this: db.inventory.find( { “storeId”:”store100”, “productId”:“20034”, “vars.sku”:”sku11736” }, {“vars.$”:1} ) Doesn&#x2019;t actually benefit much from an added index on &#x2018;vars.sku&#x2019;. In this case, our index on &#x2018;productId&#x2019; is already giving us access to the document, so an index on the variant is unnecessary. In addition, because the variants array can have thousands of entries, an index on it could potentially take up a large block in memory, and consequently decrease the number of documents stored in memory, meaning slower queries. All things considered, an unacceptable trade-off, given our goals. So what makes this schema so good anyhow? We&#x2019;ll take a look in our next post at some of the features this approach makes available to our inventory system. Learn More To discover how you can re-imagine the retail experience with MongoDB, read our white paper . In this paper, you'll learn about the new retail challenges and how MongoDB addresses them. Learn more about how leading brands differentiate themselves with technologies and processes that enable the omni-channel retail experience. Read our guide on the digitally oriented consumer << Read Part 1 Read Part 3 >>

May 12, 2015

Retail Architecture Best Practices Part 1: Building a MongoDB Product Catalog

In today’s digital-first economy, flawless e-commerce experiences are integral to securing and maintaining customer loyalty. We’re here to help. In this guide, we’ll discuss: Why retailers prefer MongoDB Guidance on data modeling for product catalogs How to make products discoverable Driving conversions with fast and relevance-based search In order to deliver the AI/ML-enriched and hyper-personalized retail experiences that are table stakes in the digital ecosystem, retailers are moving away from monolithic architectures. The first focus for this transformation is the product catalog, which is the window into modern e-commerce platforms. As consumers’ preferred shopping methods shifted to e-commerce, customer preference for real-time updates has increased dramatically. A retailer’s ability to adopt this change has a direct influence on customer loyalty and business growth. A lack of a real-time, consolidated view of customer history, orders, inventory, and supply chain network updates will hurt the bottom line. Product catalog data management is a complex problem for retailers. After years of relying on multiple monolithic, vendor-provided systems, retailers have learned that product catalogs built on legacy databases are unsuitable for modern e-commerce experiences. In today’s vendor-provided systems, product data must frequently be moved back and forth using ETL processes to ensure all systems (in store, mobile shopping, and ecommerce) are operating on the same data set. This approach is slow, error prone, and expensive in terms of development and management. In response, retailers are now making data services available individually as part of a distributed microservices. Data models established using Relational databases carry restrictions and limitations when it comes to realigning with newly discovered attributes and other modifications required. Inconsistent and fragmented data structures (imagine a supplier introducing new attributes that were previously not known - size, availability season etc) can be visible to end customers as inconsistent product information across channels. Many retailers using Relational databases must introduce new tables to track newer attributes that take time and effort to propagate across channels which maintain their own copy of the product data. Single view, scalability, flexibility, and search Modern retailers understand the need for improved web, mobile, and social media commerce experiences. Today’s customers expect the digital-first experience to follow them to brick and mortar stores, too. The next frontier of retail personalization is the seamless connection, and blurring, of the digital and physical retail experience. How will retailers get to the hybrid future of shopping? Many are approaching the challenge by investing in cloud migration, typically with a “lift-and-shift” approach by replicating their on-premise infrastructure in the cloud. For a more transformational approach to retail modernization, a developer data platform providing a single, unified view across all your data assets, scale, flexibility, and integrated search functionality can thoroughly reshape floundering modernization strategies. Single view, scalability, flexibility, and search are foundational for product catalog modernization. Here’s why: Single view: A single view of customer data breaks down data silos, allowing retailers to utilize data to offer a consistent, integrated service. Data duplication is reduced, data consistency is increased, and a holistic view of the customer is made available to the application stack, so retailers can focus on offering stronger personalized client experiences. Scalability: As retailers add more products to meet market demand and growth acceleration goals, including moving to markets in new regions, product catalog access slows. Retailers need application data platforms that can scale without sacrificing performance. Flexibility: Product catalog contains the details of various product attributes with high degree of variability that continue to evolve over a period of product life cycle. A flexible document model allows evolution of diverse data sets prevalent in retail data silos, and utilizes them to a retailers’ advantage, without worrying about data duplication, stale data, or data silos. For example a retail wants to carry a specific product line only for northern regions of Canada where it is required (ice fishing gear etc) Search: Fast, sophisticated, intuitive, relevance-based search is key to provide superior customer experience across channels, including recommendations, geospatial merchandising, and systems availability. Typical Retailer establishes a search infrastructure which becomes an additional piece of the puzzle that must be managed and maintained. MongoDB Atlas solves this problem for enabling built-in capabilities comparable with any search engine based on Lucene. MongoDB helps developers build retail product catalogs with seven key solutions: Omnichannel catalog Single view Real-time analytics Payments Fast & relevant full-text search Real-time inventory and supply chain Retail logistics: personalization Why MongoDB works for mission critical applications MongoDB application data platform radically simplifies data architecture by providing commonly known for its strengths in traditional relational systems with ACID transactions, secondary indexes, unions, joins, security and enterprise management, the real value comes from application data platform approach. MongoDB’s document model resembles intuitive data modeling aligning the real world objects within native data structures, in a flexible manner. Idiomatic drivers enable developers to work with data as code. The MongoDB Query API and drivers are idiomatic to your programming language, whether it’s C#, SWIFT, Java, Javascript, etc. Ad Hoc queries, indexing, full text search, and real time aggregations provide powerful ways for accessing grouping, transforming, searching and analyzing data to support any class of workload. This empowers you to build a global database over multiple cloud platforms, anywhere, any time with our scalability and distribution. In retailers’ product catalogs, a single item could have thousands of variants – for example, a single shoe style that comes in six different colors and 12 sizes. Product prices vary widely based on store or geography, and each product comes with reviews, promotions, data sheets, images, videos and more. Altogether, a single SKU for a product can represent tens or hundreds of different attributes, each of which can be unique to that SKU. MongoDB’s document data model helps contain and organize the chaos of retailers’ data silos into a consolidated rich JSON document. This model enables retailers to dramatically reduce hours spent on database administration. Development teams are freed up to focus on launching new features instead of grappling with rigid, tabular data models that bear no relation to the product being modeled, and that need to be constantly changed as new products are added to the catalog. The specifics: data models and per store pricing Now that you’ve learned the key reasons why retailers choose MongoDB for e-commerce solutions, we’ll take a look at the specifics of how we put some of these to use in our retail reference architecture to support a number of features, including: Searching for products and product variants Retrieving per store pricing for items Enabling catalog browsing with faceted search Product Data Model The first thing we need to consider is the data model for our items. In the following examples we are showing only the most important information about each item, such as category, brand and description: { “_id”: “30671”, //main item ID “department”: “Shoes”, “category”: “Shoes/Women/Pumps”, “brand”: “Calvin Klein”, “thumbnail”: “http://cdn.../pump.jpg”, “title”: “Evening Platform Pumps”, “description”: “Perfect for a casual night out or a formal event.”, “style”: “Designer”, … } This type of simple data model allows us to easily query for items based on the most common predicates. For example, using db.collection.findOne , which will return a single document that satisfies a query: Get item by ID db.definition.findOne({_id:30671}) Get items for a set of product IDs db.definition.findOne({_id:{$in:[30671;452318]}}) Get items by category prefix db.definition.findOne({category:/^Shoes/Women/}) Notice how the second and third queries used the $in operator and a regular expression, respectively. When performed on indexed collections, MongoDB provides high throughput and low latency for common queries that the users are generally looking for. Variant Data Model Another important consideration for our product catalog is item variants, such as available sizes, colors, and styles. Our item data model above only captures a small amount of the data about each catalog item. So what about all of the available item variations we may need to retrieve, such as size and color? One option is to store an item and all its variants together in a single document. This approach has the advantage of being able to retrieve an item and all variants in a single query. However, it is not the best approach in all cases. It is an important best practice to avoid unbounded document growth. If the number of variants and their associated data is small, it may make sense to store them in the item document. Another option is to create a separate variant data model that can be referenced relative to the primary item: { “_id”: ”93284847362823”, //variant sku “itemId”: “30671”, //references the main item “size”: 6.0, “color”: “red” … } This data model allows us to do fast lookups of specific item variants by their SKU number: db.variation.find({_id:93284847362823}) As well as all variants for a specific item by querying on the itemId attribute: db.variation.find({itemId:30671}).sort({_id:1}) In this way, we maintain fast queries on both our primary item for displaying in our catalog, as well as every variant for when the user requests a more specific product view. We also ensure a predictable size for the item and variant documents. MongoDB document model not only gives the flexibility of dynamic schema but also enforce governance through schema validation . Per store pricing Another consideration when defining the reference architecture for our product catalog is pricing. We’ve now seen a few ways that the data model for our items can be structured to quickly retrieve items directly or based on specific attributes. Prices can vary by many factors, like store location, geograpies, local currencies and taxation laws. We need a way to quickly retrieve the specific price and other relevant factors of any given item or item variant. This can be very problematic for large retailers, since a catalog with a million items and one thousand stores means we must query across a collection of a billion documents to find the price of any given item. We could, of course, store the price for each variant as a nested document within the item document, but a better solution is to again take advantage of how quickly MongoDB is able to query on _id. For example, if each item in our catalog is referenced by an itemId, while each variant is referenced by a SKU number, we can set the _id of each document to be a concatenation of the itemId or SKU and the storeId associated with that price variant. Using this model, the _id for the pair of pumps and its red variant described above would look something like this: Item: 30671_store23 Variant: 93284847362823_store23 This approach also provides a lot of flexibility for handling pricing, as it allows us to price items at the item or the variant level. We can then query for all prices or just the price for a particular location: *All prices: db.prices.find({_id:/^30671/}) Store price: db.prices.find({_id:/^30671_store23/}) We could even add other combinations, such as pricing per store group, and get all possible prices for an item with a single query by using the $in operator: db.prices.find({_id:{$in: [ “30671_store23”, “30671_sgroup12”, “93284847362823_store23”, “93284847362823_sgroup12” ]}}) Product catalog search: How hard is it to build full-text search? Product catalogs contain a high degree of variability of information per product type or category. Complex product structures like serialized items/bundles/kits require an exhaustive process to update the catalog as well as to search. Product search bar on an ecommerce website need to be powered by a search technology that infers the intent of the customer to surface the right product(s) to further enhance customer experience. A product catalog is only as good as the results its search engine provides. Customers expect to find what they are looking for based on a few keystrokes and intelligent guesses by the commerce platform. They depend on the search bar, which is essentially a window into your ecommerce store. The faster and more relevant search results need to be to satisfy customer expectations, the more complex and difficult it is to build. A common misperception is that a fast database can support the search requirements for e-commerce interfaces. But database queries are much less dynamic than search queries. Database queries are ideal for situations when developers know what queries to expect, based on how the application works. They can index fields based on common query patterns to improve performance, and queries are optimized for correctness and integrity. For example: “show me all navy blue Calvin Klein evening shoes in size 8”—it’s straightforward, there’s an exact answer, and I can create an index on the relevant brand, title, size, and color fields. On the other hand, search queries are designed for situations where developers don’t know what user queries will look like in advance. They are often expressed in natural language questions and may require searching across a lot of different fields. These queries are optimized for speed and relevance; they return results immediately, and sort them by how closely they match the users and search terms. For example: “show me all women’s pumps for evening and casual wear — in this case, we have to consider the color, brand, image alt text, meta tags, etc. And the next query might be entirely different—it could ask, “show me all evening shoes priced between $50-$100.” Legacy product search architectures use several different systems to deliver this capability, including: data in RDBMS, a separate search engine, and a caching layer to allow faster responses to page rendering resulting in complex architectures and tremendous synchronization need across systems. Modern retail search requirements for e-commerce sites Today’s online shoppers are savvy internet users. We’re all accustomed to the quick, robust search results we get from Google. Once again, there are 7 key solutions that MongoDB offers to help retailers build search experiences that run like Google and give consumers the relevant results they need to make quick and easy purchases. Here are the key search requirements for e-commerce sites: Fuzzy search , autocomplete, synonyms, and analyzers help users get the right search results quickly and easily. Faceted search and counts help users efficiently navigate categorized search results on several attributes Highlighted extract snippets help users understand a product’s relevance to their search term Geospatial search allows users to filter and return results by location Response within milliseconds for hundreds or thousands of items Relevance tuning returns sponsored or preferred products higher up in the results set Page rendering/pagination , which requires deterministic ordering How retailers build search for product catalogs today The way that developers in the retail industry build search today can be grouped into three buckets: simple database search, bolt-on search engine, and customized search. Simple database search: examples of this are the $text and $regex operators in MongoDB. They are easy to use and since they are common database commands, there’s no data sync required. On the flip side, they are often quite slow leading to poor performance, they are limited to just text data, and provide really no way to tune the relevance of results. Bolt-on search engine: examples of this are Elasticsearch, Algolia, AWS OpenSearch Service, Azure Cognitive Search, and in some cases, Solr. These are popular and well-known search engines available on the market and generally provide fast, relevant results. However, they are yet another system to develop against, pay for, and manage, and require constantly syncing data to and from a database. We’ll dig more into the pains shortly. Customized search: examples of this are Solr and Lucene. Now, Lucene is the open source technology that powers many of these solutions, including Elasticsearch, Solr, and our very own Atlas Search. And while it’s great for search and you can have complete control over what features to use, it’s incredibly expensive to manage and maintain yourself, and requires seasoned experts on the team to run successfully. Think twice before bolting on a search engine to your database Option two for building out search, the bolt-on method, can lead to major complexity and architectural sprawl in your ecommerce platform. First, bolting on a search engine to your database results in lower developer productivity. When it comes to building search functionality, developers have to use different drivers and query languages to access their database and search clusters. The learning curve and need to switch context can lead to wasted developer hours. Second, architectural complexity. Since the data that needs to be searched is stored in the database cluster, there must be mechanisms in place to sync that data to the separate search cluster. This involves scripts and processes to transform and structure the data into a format that the search cluster can index and query. And when the underlying schema changes in the database, developers have to spend time coordinating those changes between the two systems. Finally, there’s operational overhead. Every icon in the diagram above requires its own infrastructure and support to ensure that it’s highly available, secure, backed up, using the latest software, scaled to meet changing demands, etc. All of this costs time, people and money. Think about it. You have three separate systems for database, search, and sync — even in fairly small companies, there could be 3-4 employees just maintaining this small part of the application. How MongoDB helps: MongoDB Atlas Search The biggest challenge for our product catalog is enabling users to browse and discover products using natural language queries and giving results quickly, even if the user doesn’t know the name of the product, or spells it incorrectly. While many users will want to search our product catalog for a specific item or criteria they are looking for, many others will want to browse, then narrow the returned results by any number of attributes. So given the need to create a page like this: We have many challenges mentioned above: response time, multiple attributes, variant-level attributes, multiple variants, page sorting, pagination, and typos. Another challenge for retailers is controlling which products are surfaced first in search results. For example, a company may be running a promotional campaign, prioritizing its own brand or excess stock on clearance. All of these complexities fall under the umbrella of relevance tuning. For catalog and content search, MongoDB’s document model handles a massive variability of catalog and content data. MongoDB Atlas provides real-time analytics capabilities with time series collections to capture clickstreams and conversions. Atlas Search provides all of the features you need to deliver rich and personalized experiences to your users, including fuzzy matching; autocomplete; lightning fast facets and counts; highlighting; relevance scoring; geospatial queries; and synonyms, all backed by support for multiple analyzers and languages. These capabilities come together to help you boost user engagement and improve customer satisfaction with your applications – from product catalog and content search to powering complex, ad-hoc queries in your line-of-business applications. Atlas Search is part of MongoDB Atlas, the multi-cloud application data platform that combines transactional processing, relevance-based search, real-time analytics, mobile edge computing with cloud sync, and cloud data lake in an elegant and integrated data architecture. Through a flexible document data model and unified query interface, Atlas provides a superior developer experience to power almost any class of application. At the same time it meets the most demanding requirements for resilience, scale, and data privacy. Most importantly, its geo-distributed clusters provide a distributed architecture for resilience and low latency. This allows for auto-scale and sharding to handle promotional traffic spikes. The benefits are immense: users can quickly find the most relevant matches using flexible search terms in any language and research and compare product and content categories. Your platform will be able to summarize product or content directly within the search results, and can boost preferred search results for promotions. Learn more Now that you have explored our e-commerce solutions for product catalogs and search, dive into our next post in the series: Approaches to inventory optimization .

May 4, 2015

MongoDB 3.0 and MongoDB Ops Manager are Now Generally Available

MongoDB 3.0 is now generally available for production deployments - download the community version or MongoDB Enterprise, with Ops Manager today. Thank you to our community for participating in the bug hunt and for helping to ensure that MongoDB 3.0 is production ready for mission-critical applications. Click to Tweet: MongoDB 3.0 is now generally available. Download it now! bit.ly/1K84MF4 MongoDB 3.0 reduces operational overhead by up to 95%, has 7x-10x better write performance, reduces on-disk database size by up to 80%, and dramatically increases the ability to deliver apps to market quickly and efficiently. There are a number of resources to help you learn more about the release of MongoDB 3.0 and Ops Manager: Download the What’s New in MongoDB 3.0 white paper Watch the What’s New in MongoDB 3.0 Webinar Read what our CTO, Eliot Horowitz has to say about the GA of MongoDB 3.0 Check out the 3-part blog series on what’s new in MongoDB 3.0: What’s New in MongoDB 3.0 Part 1: 95% Reduction in Operational Overhead and Security Enhancements What’s New in MongoDB 3.0 Part 2: Richer Query Language, Enhanced Tools and Global Multi-Data Center Deployments What’s New in MongoDB 3.0 Part 3: Performance & Efficiency Gains, New Storage Architecture Sign up for the Introducing Ops Manager Webinar (March 5) Read the MongoDB 3.0 release notes and documentation Hear what our customers have to say about MongoDB 3.0 and MongoDB Ops Manager in our press release . On MongoDB 3.0 “Part of Soundwave's success stems from migrating the original app early from MySQL to MongoDB. MongoDB is not constrained by a rigid schema, so development is much faster and more iterative," said David Lynch, Principal Engineer at Soundwave. "Now as the app continues to grow in use, using the latest version of MongoDB is key for us. We're seeing significantly better write performance and great compression in testing, which will enable us to easily scale with the demands of our traffic and data volumes while also lowering our production costs.” “The Sailthru team is excited about what MongoDB 3.0 will do for our massive production deployment," said Jeffrey Berger, Lead Database Administrator at SailThru. "In our tests, the compression factor alone will dramatically improve our hardware utilization, saving us immense amounts of money and time.” On MongoDB Ops Manager “The new automation features of Ops Manager have blown me away! I have used it to deploy replica sets, which I can now do at the click of a button, rather than dozens of manual steps previously,” said Laurence Moughan, Unix Systems Administration & Infrastructure Services at Aer Lingus. “Reconfiguring running instances is a breeze. The monitoring capabilities give me all the stats I could ever need to proactively manage my MongoDB deployment. Ops Manager will save our operations team many hours. I can't wait to get it into production!” For additional customer quotes on MongoDB 3.0, including BuzzFeed, Chico’s FAS, eBay and Parse, visit here . If you’re considering updating your version of MongoDB, take a look at our Major Version Upgrade consulting services: Upgrade with Confidence

March 2, 2015