MongoDB World

37 results

MongoDB World 2022 Recap — Performance Gotchas of Replicas Spanning Multiple Data Centers

Indeed has more than 25 million open jobs online at any one time. It stores more than 225 million resumes on Indeed systems, and it has 250 million unique users every month. Indeed operates enterprise-wide global clusters in the cloud across multiple availability zones all around the world, including the United States, Asia-Pacific, Europe, and Australia. Indeed is also a MongoDB super user. About 50% of everything Indeed does is built on MongoDB. In a recent session at MongoDB World 2022, Indeed senior cloud database engineer Alex Leong shared real-world experiences of performance issues when spanning replica sets across multiple data centers. He also covered how to identify these issues and, most importantly, how to fix them. This article provides highlights from Leong’s presentation, including dealing with changes in sync sources, replication lags, and more. Resilience and performance Indeed maintains multiple data centers for resiliency. Having multiple data centers ensures there's no single point of failure and keeps data in close proximity to job seekers' locations. This approach facilitates faster response times and better overall end user experience. Running multiple data centers can introduce other performance issues, however. One issue involves the initial sync of new nodes in the system, which needs to happen as quickly as possible to avoid returning stale data. Write concern is a critical consideration because, if there's an interruption on a primary node and a failover to a secondary, when you eventually roll back to the primary, any changes that were captured on the secondary while the system was running in failover mode must be preserved. Also, when you're running multiple data centers, changes in sync sources can occur that go unnoticed. Replication lags can occur when data centers are located far apart from each other. Overriding sync sources When you have an environment with hundreds of millions of users and enormous volumes of data spanning several geographic regions, spinning up and synchronizing a new node in a replica set creates logistical hurdles. To start, you have to decide where the new node syncs from. It seems logical that the default decision would be to sync with the nearest node. But, as Leong said in his session, at times you may not get the nearest sync source, and you may have to override the default sync source to choose the best one. This decision needs to be made early, Leong said, because doing so later means any progress you've made toward syncing the new node will have been wasted. Replication lags Replication lags can occur between the primary and secondary nodes for several reasons, including downtime (planned or unplanned) on the primary server, a network failure, or disk failure. Whatever the reason, there are ways to speed things up. In his session, Leong illustrates how to use the WiredTiger cache size to accelerate replication between nodes. Changes in sync sources Leong uses the term sync topology to describe how primary and secondary nodes are configured for syncing data between them. In some scenarios, a secondary node can change its sync source (sync topology) from one node to another, perhaps because the first node was busy at the time. MongoDB makes this change automatically, and it might not be noticed without looking at the log. Fixing cross-data center write concerns According to Leong, when write performance decreases, 99% of the time it's because of a change in sync sources. To be proactive, Leong creates a write performance monitor to identify and self-heal decreases in write performance so he doesn't have to find out the hard way (from users). Other critical performance issues covered in the session include chained replication , which is the process by which secondary nodes replicate from node to node, changing write concern when a secondary node goes down, and how to configure write concerns across Availability Zones in AWS. For more details, watch the complete session from MongoDB World 2022: Performance Gotchas of Replicas Spanning Multi Datacenters .

November 17, 2022

How to Model Data in Document Databases for Read and Write Performance

MongoDB is often seen as a good choice for storing unstructured data. The ability to persist data in MongoDB without defining what type of data it is or designing schema for it is one of the reasons many of our customers choose us. But the idea that MongoDB is a "schemaless" database is not accurate. Although a document database does allow you to store data without defining what it is, the shape of that data matters if you plan to do more than simply retrieve whole documents by keys. The Need to Model for NoSQL At this year's MongoDB World event, Daniel Coupal , staff developer advocate at MongoDB, explained that the need to model data in a document database is due to the presence of constraints that must be taken into account when you're persisting data in MongoDB. Constraints include things like network and hard disc speed, maximum file size of documents, and features that you don't have now but might add later. "If you look at the stack of an application, you have the application that talks to MongoDB that talks to whatever layer you add — it could be the cloud or a physical machine — those constraints are going to map to some of those layers," Coupal said. "It's imperative to know the features of the products you use." MongoDB offers features like transactions , field-level encryption , data federation , and archives , which require that you model data differently. In relational databases , data modeling is fairly straightforward due to the nature of third normal form (3NF) — the database schema design approach for relational databases. So, essentially, there’s only one solution for modeling the database. With the document model and MongoDB, however, you have several options for data modeling. You can nest everything under a single collection , and possibly wind up with duplicate sets of data (and, therefore, data concurrency issues), or you can use separate collections for different datasets and avoid duplicate data altogether. Ultimately, according to Coupal, "the optimal grouping of objects into collections is determined by the workload." During the session, Coupal provided a breakdown of data modeling methodology that involves a three-phase process starting with the workload, proceeding to relationships, and moving to patterns for optimization purposes. "In a lot of the solutions we're trying to build with NoSQL, performance is a top requirement," he said. He also cited better performance as one of the big reasons why people switch from SQL to MongoDB. What are the Data Access Patterns? In essence, with MongoDB, the way you plan to access the data determines the way you store it in the database. Data that is accessed together should be stored together. In the session, Coupal also presented an insightful analogy between the nature of data modeling in relational databases versus the document model in MongoDB. Essentially, the difference is that with the relational model, if you have a car and you want to model it, you'll take each part of the car individually and place it in its own table. Then, when you want to use the car, you have to reassemble it part by part (and table by table) before you can drive it. With MongoDB, you take the car and put it in the garage (the equivalent of a collection). When you want to use it, you take it out. That's it. "We do only one read on the disc to get everything we need together," Coupal said. Techniques for Data Modeling in MongoDB Coupal also provided an explanation of two different data modeling techniques: referencing and embedding . Embedding is a way to combine what would normally be two tables in a tabular database into one using an array. "The array is the expression of the one-to-many relationship," Coupal said. Referencing is useful for when the "many" side of the relationship is a huge number. Although MongoDB does support transactions, in almost all cases, it's better to use a document for more efficient read-write performance. As we know, developers are the ones who are most likely to understand the data access patterns for their applications. Properly designed schemas can increase performance for a given set of hardware by reducing computation, I/O, and contention. What really differentiates MongoDB from relational databases is the ability to co-locate related data in the atomic unit of storage so multiple values for an attribute can exist within a single record rather than being broken up into rows and stored independently. A document database with a properly designed schema lets you filter and retrieve data with minimal computational overhead and in a single I/O operation. This approach can make finding and retrieving data far faster and less expensive. To see the complete session from MongoDB World 2022, which includes a list of 12 data modeling patterns and techniques for evolving schema in MongoDB, watch The Principles of Data Modeling for MongoDB .

October 20, 2022

Relational to NoSQL at Enterprise Scale: Lessons from Amazon

When most people think about Amazon, they think of the cloud. But the company was founded more than a decade before anyone was talking about the cloud. In fact, by 2002, when Amazon founder Jeff Bezos wrote a now-famous internal email directing all new software development to be designed around service-oriented architecture, Amazon was already a $5 billion enterprise. In 2017, Amazon was generating more than 50 times that annual revenue, and like many enterprise organizations, the core of that revenue was driven by the monolithic services that formed the backbone of the business. Those monoliths didn’t go away overnight, and in 2017 and 2018, Amazon kicked off a massive RDBMS-to-NoSQL migration project called “Rolling Stone” to move about 10,000 RDBMS-backed microservices as well as decompose the remaining monoliths into microservices backed by NoSQL. Amazon chose to use its own NoSQL database, but the lessons from that huge effort are valuable for any migration to a NoSQL or document database. In this article, I’ll share some of the insights gained about when and how to use NoSQL. RDBMS costs At the time of this migration, I ran the NoSQL Blackbelt Team for Amazon’s retail business, which was the center of excellence for the business and which developed most of the design patterns and best practices that Amazon uses to build NoSQL-backed application services today. In 2017, Amazon had more than 3,000 Oracle server instances, 10,000 application services and 25,000 global developers, and almost the entire development team was well versed in relational database technology. The cost of the IT infrastructure driving the business, however, was spiraling out of control. As the team started to look for root causes, they quickly realized that the cost of the relational database management system (RDBMS) was a big line item. The infrastructure required to support RDBMS workloads was enormous and did not scale well to meet the needs of the company’s high-demand services. Amazon had the biggest Oracle license and the largest RAC deployments in the world, and the cost and complexity of scaling services built on RDBMS was negatively affecting the business. As a result, we started looking at what we were actually doing in these databases. A couple of interesting things came out. We found that 70% of the access patterns that we were running against the data involved a single row of data on a single table. Another 20% were on a range of rows on a single table. So, we weren’t running complex queries against the data at high velocity. In fact, the vast majority were just inserts and updates, but many of those were executed “transactionally” across disparate systems using two-phase commits to ensure data consistency. Additionally, the cost was very high for the other 10% of the access patterns because most were complex queries requiring multiple table joins. Technology triggers While the team was looking into these issues, they also noticed a trend in the industry: Per core CPU performance was flattening, and the server processor industry was not investing enough in 5 nm fabrication technology to meet the efficiency increases described by Moore’s Law . This is one of the reasons why Amazon built its own processor . If you look at the history of data processing, you’ll see a series of peaks and valleys in what can be defined as “data pressure,” or the ability of a system to process the required amount of data at a reasonable cost and within a reasonable amount of time. When one of these dimensions is broken, it defines a “technology trigger” that signals the need to invent something. At Amazon, we saw that the cost efficiency of the relational database was declining while the TCO of high time-complexity queries was increasing as a result. Something had to change. Relational data platforms only scale well vertically, which means getting a bigger box. Sooner or later, there is no bigger box, and the options to scale an RDBMS-backed system introduce either design complexity or time complexity. Sharding RDBMS systems is hard to self-manage. And, although distributed SQL insulates users from that complexity by providing things like distributed cross commits behind the API to maintain consistency, that insulation also comes at a cost, which can be measured in the time complexity of the queries running across the distributed backend. At the same time, the cost of storage was falling and the promise of denormalized, low time-complexity queries in NoSQL was enticing to say the least. Clearly, it was never going to get any cheaper to operate a relational database; it was only going to get more expensive. Thus, Amazon made the decision to undertake what may be the largest technology migration ever attempted and deprecate RDBMS technology in favor of NoSQL for all Tier 1 services. A new approach to building NoSQL skills Project Rolling Stone launched with great fanfare and had buy-in from all the right stakeholders. But things didn’t go well at first. Amazon’s developers were now using a database designed to operate without the complex queries they had always relied on, and the lack of in-house NoSQL data modeling expertise was crippling the migration effort. The teams lacked the skills needed to design efficient data models, so the early results from prototyped solutions were far worse than anticipated. To correct this situation, leadership created a center of excellence to define best practices and educate the broad Amazon technical organization; the NoSQL Blackbelt Team was formed under my leadership. The challenge before us was enormous. We had limited resources with global scope across an organization of more than 25,000 technical team members. The traditional technical training approach built on workshops, brown bags and hackathons did not deliver the required results because the Amazon organization lacked a core nucleus of NoSQL skills to build on. Additionally, traditional training tends to be sandboxed around canned problems that are often not representative of what the developers are actually working on. As a result, technical team members were completing those exercises without significant insight into how to use NoSQL for their specific use cases. To correct this situation, we reworked the engagement model. Instead of running workshops and hackathons, we used the actual solutions the teams were working on as the learning exercises. The Blackbelt Team executed a series of focused engagements across Amazon development centers, where we delivered technical brown bag sessions to advocate best practices and design patterns. Instead of running canned workshops, however, we scheduled individual design reviews with teams to discuss their specific workloads and prototype a data model they could then iterate on. The result was powerful. Teams gained actionable information they could build on, rather than general knowledge that might or might not be relevant to their use case. During the next three years, Amazon migrated all Tier 1 RDBMS workloads to NoSQL and reduced the infrastructure required to support those services by more than 50%, while still maintaining a high business growth rate. Watch Rick Houlihan’s full MongoDB World 2022 presentation, “From RDBMS to NoSQL at Enterprise Scale.” When to use NoSQL - Looking at Access Patterns When should you use NoSQL? I had to answer this question many times at Amazon, and the answer isn’t so clear-cut. A relational database is agnostic to the access pattern. It doesn’t care what questions you ask. You don’t have to know code, although some people would argue that SQL is code. You can theoretically ask a simple question and get your data. Relational systems do that by being agnostic to every access pattern and by optimizing for none of them. The reality is that the code we write never asks random questions. When you write code, you’re doing it to automate a process — to run the same query a billion times a day, not to run a thousand random queries. Thus, if you understand the access patterns, you can start doing things with the data to create structures that are much easier for systems to retrieve while doing less work. This is the key. The only way to reduce the cost of data processing and the amount of infrastructure deployed is to do less work. OLTP (online transaction processing) applications are really the sweet spot for NoSQL databases. You’ll see the most cost efficiency here because you can create data models that mirror your access patterns and representative data structures that mirror your objects in the application layer. The idea is to deliver a system that is very fast at the high-velocity access patterns that make up the majority of your workload. I talk more about data access patterns and data modeling at a recent Ask Me Anything . Making It All Work There’s a saying that goes, “Data is like garbage. You better know what you are going to do with it before you collect it.” This is where relationships come into play. Nonrelational data, to me, does not exist. I’ve worked with more than a thousand customers and workloads, and I’ve never seen an example of nonrelational data. When I query data, relationships become defined by the conditions of my query. Every piece of data we’re working with has some sort of structure. It has schema, and it has relationships; otherwise, we wouldn’t care about it. No matter what application you’re building, you’re going to need some kind of entity relationship diagram (ERD) that describes your logical data diagram, entities and how they’re related to understand how to model it. Otherwise, you’re just throwing a bunch of bytes in a bucket and randomly selecting things. A relationship always exists between these things. In relational models, they’re typically modeled in third normal form (3NF). For example, in a typical product catalog, you’ll see one-to-one relationships between products and books, products and albums, products and videos, one-to-many relationships between albums and tracks, and many-to-many relationships between videos and actors. This is a pretty simple ERD — we’re not even talking about any complex patterns. But suppose you want to get a list of all your products, you’d have to run three different queries and various levels of joins. That’s a lot of things going on. In a NoSQL database, you’re going to take all those rows and collapse them into objects. If you think about the primary access pattern of this workload, it’s going to be something like, “Get me the product by this ID,” or “Get me all the titles under this category.” Whenever you want the product, you typically want all the data for the product because you’re going to use it in a product window. If you put it all in one document, you no longer have to join those documents or rows. You can just fetch the data by product ID. If you think about what’s happening from a time-complexity perspective, when you have all that data in tables, your one-to-one joins won’t be so bad, but with a one-to-many, the time complexity starts expanding. Again, the examples mentioned here are fairly simple. When you start getting into nested joins, outer and inner, and other more complex SQL statements, you can imagine how much worse the time complexity becomes. That’s your CPU burning away, assembling data across tables. If you’re running a relational data model and you’re joining tables, that’s a problem. Index and conquer Let’s think about how we model those joins in NoSQL. To start, we have a key-value lookup on a product. But we can also create an array of embedded documents called “target” that contains all the things the product is related to, as shown in Figure 1. It contains metadata and anything about the product you need when you query by product ID. Now that we’re using embedded documents, there’s no more time complexity. It’s all an index lookup. It can be one-to-one, one-to-many, many-to-many — it doesn’t matter. As long as the aim is “get the document,” it’s still an index lookup. Figure 1:   Creating an array called “target” eliminates the need to join data from different rows, columns or tables. Of course, a lot more goes into an application than an index lookup. Remember, 70% of our access patterns at Amazon were for a single row of data, and 20% were for a range of rows on a single table. For more complex access patterns, we’re going to need more dimensions. If, for example, we’re running a query for all the books by a given author or all people related to “x,” this will require adding more dimensions, or documents, to the collection. We can create documents for other actors who were in a collection of movies, directors of all the movies, songs from the movies, how these songs relate to other entities in this collection, writers, producers, artists who performed the songs and all the albums those songs appeared on, as shown in Figure 2. Figure 2:   Create more dimensions by adding documents to the collection. Now, if I index the “target” array — which is one of the best things about MongoDB and document databases, multikey arrays — I can create essentially a B-tree lookup structure of those “target” IDs and join all those documents and all of those dimensions, as shown in Figure 3. Now I can select, for example, where target ID is Mary Shelley and get everything she’s related to — the books, people, critiques of her work. Where the target ID is a song title, I can get all the information about that song. Figure 3:   Multikey arrays create what is essentially a B-tree lookup structure, joining all related documents. Essentially, we’re using the index as a join mechanism, which is a critical distinction in NoSQL. At AWS, many teams came to me and told me that NoSQL doesn’t work. The key thing to understand, however, is if you index documents that are stored in the same table or collection on a common dimension that has the same value, you’ve essentially eliminated the need to join that same index across that same value and across multiple tables. That’s what the relational database does. You don’t want to join unindexed columns in a relational database because it will incur a lot of overhead. You want to index those attributes and put pointers to parent objects and child tables and then join on those IDs. With NoSQL, we’re essentially placing all those items in a single table and indexing on the ID. This approach also eliminates the time complexity. If all those documents share a common table, and they’re indexed on a common attribute, the time complexity is 0(log(N)) . Seventy percent of the overhead of handling a request from a database is not getting the data. It’s managing the connection, marshaling the data and moving it back and forth across the TCP/IP stack. So, if I can eliminate one request from a transaction, I’m going to reduce the overhead of that transaction. Conclusion Data that is accessed together should be stored together. That is the mantra that we’ve always espoused at MongoDB. Once we started learning how to use NoSQL at Amazon, we started having better results. We did that through regularly scheduled training sessions where we could teach the fundamentals of NoSQL using our own workloads. That’s what my developer advocacy team at MongoDB does now with customers. We provide templates for how to model data for their workloads to help them do it themselves.

September 26, 2022

How to Leverage Enriched Queries with MongoDB 6.0

MongoDB introduces useful new functions and features with every release, and MongoDB 6.0, released this summer, offers many notable improvements , including deeper insights from enriched queries via the MongoDB Query API . This set of query enhancements was announced at MongoDB World 2022 by senior product manager Katya Kamenieva. You can watch her presentation below. Watch Kayta Kamenieva’s MongoDB World presentation on queries. Users can now use upgraded operators and change stream features. In this post, we’ll look at several of these updates, along with examples of how you can put them to use. Top N accumulators With this new feature, users can compute top items in each group based on the sort criteria ( $topN , $bottomN ), current order of documents ($firstN, $lastN), or value of a field ( $manX , $minN ). This functionality would be useful, for example, if you have a collection of restaurants with ratings, and you want to see the top three highest-rated restaurants based on the type of cuisine. You can group by cuisine and use $topN to return the top three restaurants by rating. Ability to sort arrays The ability to sort an array allows users to sort elements in the array. For example, suppose you have posted content with hundreds of user comments, and you want to sort the comments based on how many likes they received. In this case, $sortArray can pull those comments and prioritize them to the top of the comments list. Densification and gap-filling These new additions to the aggregation framework help to build out time series data more completely. When attempting to create histograms of data over time, the new stages, $denisfy and $fill , allow you to fill gaps in that data to create smoother and more complete graphs using linear interpolation, last/next observed value carried forward, or a constant value. This capability can be helpful, for example, if you want to create a graph that shows the amount of inventory in a warehouse every day for a year, but the inventory was only recorded once a week. The $densify expression will fill the gaps in the timeline, while $fill will produce values for the inventory data based on the previous observation. Joining sharded collections With this new feature, when joining collections using $lookup or performing recursive search with $graphLookup , collections on both sides can be sharded. Before 6.0, only the originating collection could be sharded. An example use case is enriching records in the “accounts” collection with the list of the corresponding orders that are stored in the “orders” collection. In the past, only “accounts” collections could be sharded. Starting with 6.0, both “accounts” and “orders” collections can be sharded. Change streams pre- and post-images Change streams now offer point-in-time (PIT) pre- and post-image capabilities , allowing users to include the state of the document before and after changes in the output of the change stream. This functionality can be useful in many situations. For example, suppose a company is tracking flight times. If a flight is delayed, the system can compare the value of the departure and arrival times both before and after that delay and trigger an automatic rewrite of the schedule for the new flight timeline, including schedules for the entire crew. Atlas Search across multiple collections This improvement to MongoDB Atlas Search allows users to search across multiple collections with a single query using $search inside the $unionWith or $lookup stages. $search can provide these results quickly, using only one query. Enriched queries are not the only improvements in MongoDB 6.0. Read about the 7 reasons to upgrade to MongoDB 6.0 and discover the possibilities. Try MongoDB Atlas for Free Today

September 20, 2022

3 Reasons (and 2 Ways) to Use MongoDB’s Improved Time Series Collections

Time series data, which reflects measurements taken at regular time intervals, plays a critical role in a wide variety of use cases for a diverse range of industries. For example, park management agencies can use time series data to examine attendance at public parks to better understand peak times and schedule services accordingly. Retail companies, such as Walmart , depend on it to analyze consumer spending patterns down to the minute, to better predict demand and improve shift scheduling, hiring, warehousing, and other logistics. As more sensors and devices are added to networks, time series data and its associated tools have become more important . In this article, we’ll look at three reasons (and two ways) to use MongoDB time series collections in your stack. This in-depth introduction to time series data features MongoDB Product Manager Michael Gargiulo. Reason 1: Purpose-built for the challenges of time series data At first glance, time series collections resemble other collections within MongoDB, with similar functionalities and usage. Beneath the surface, however, they are specifically designed for storing, sorting, and working with time series data. For developers, query speed and data accessibility continue to be challenges associated with time series data. Because of how quickly time series data can accumulate, it must be organized and sorted in a logical way to ensure that queries and their associated operations can run smoothly and quickly. To address this issue, time series collections implement a key tenet of the MongoDB developer data platform: Data that is stored together is accessed together. Documents (the basic building block of MongoDB data) are grouped into buckets, which are organized by time. Each bucket contains time series data from a variety of sources — all of which were gathered from the same time period and all of which are likely to show up on the same queries. For example, if you are using time series collections to analyze the rise in summer temperatures of Valencia, Spain from 1980 to 2020, then one bucket will contain temperatures for August 1991. Relevant, but distinct buckets (such as temperatures for the months of June and July 1991) would also be stored on the same page for faster, easier access. MongoDB also lets you create compound indexes on any measurement field in the bucket (whether it’s timeField or metaField) for faster, more flexible queries. Because of the wide variety of indexing options, operations on time series data can be executed much more quickly than with competing products. For example, scan times are reduced by indexing buckets of documents (each of which has a unique identifier) rather than individual documents. In terms of the previous example, you could create an index on the minimum and maximum average summer temperatures in Valencia, Spain from 1980 to 2020 to more quickly surface necessary data. That way, MongoDB does not have to scan the entire dataset to find min and max values over a period of nearly four decades. Another concern for developers is finding the last metadata value, which in other solutions, requires users to scan the entire data set — a time-consuming process. Instead, time series collections use last point queries, where MongoDB simply retrieves the last measurement for each metadata value. As with other fields, users can also create indexes for last points in their data. In our example, you could create an index to identify the end of summer temperatures in Valencia from 1980 to 2020. By indexing the last values, time series collections can drastically reduce query times. Another recurring challenge for time series applications is data loss from Internet of Things (IoT) applications for industries such as manufacturing, meteorology, and more. As sensors go offline and gaps in your data appear, it becomes much more difficult to run analytics, which require a continuous, uninterrupted flow of data. As a solution, the MongoDB team created densification and gap filling. Densification, executed by the $densify command, creates blank, placeholder documents to fill in any missing timestamps. Users can then sort data by time and run the $fill command for gap filling. This process will estimate and add in any null or missing values in documents based on existing data. By using these two capabilities in tandem, you will get a steady flow of data to input into aggregation pipelines for insights. Reason 2: Keep everything in house, in one data platform Juggling different data tools and platforms can be exhausting. Cramming a bunch of separate products and technologies into a single infrastructure can create complex architectures and require significant operational overhead. Additionally, a third-party time series solution may not be compatible with your existing workflows and may necessitate more workarounds just to keep things running smoothly. The MongoDB developer data platform brings together several products and features into a single, intuitive ecosystem, so developers can use MongoDB to address many common needs — from time series data to change streams — while reducing time-consuming maintenance and overhead. As a result, users can take advantage of the full range of MongoDB features to collect, analyze, and transform time series data. You can query time series collections through the MongoDB Compass GUI or the MongoDB Shell , utilize familiar MongoDB capabilities like nesting data within documents, secondary indexes, and operators like $lookup or $merge, and process time series data through aggregation pipelines to extract insights and inform decision making. Reason 3: Logical ways to organize and access time series data Time series collections are designed to be efficient, effective, and easy to use. For example, these collections utilize a columnar storage format that is optimized for time series data. This approach ensures efficiency in all database operations, including queries, input/output, WiredTiger cache usage, and storage footprints for both data and secondary indexes. Let’s look, for example, at how querying time series data collections works. When a query is executed, two things happen behind the scenes: Bucket unpacking and query rewrites. To begin with, time series collections will automatically unpack buckets — similar to the $unwind command. MongoDB will unscroll compressed data, sort it, and return it to the format in which it was inserted, so that it is easier for users to read and parse. Query rewrites work alongside bucket unpacking to ensure efficiency. To avoid unpacking too many documents (which exacts a toll in time and resource usage), query rewrites use indexes on fields such as timestamps to automatically eliminate buckets that fall outside the desired range. For example, if you are searching for average winter temperatures in Valencia, Spain from 1980 to 2020, you can exclude all temperatures from the spring, summer, and fall months. Now that we’ve examined several reasons to consider MongoDB time series collections, we’ll look at two specific use cases. Use case 1: Algorithmic trading Algorithmic trading is a major use case for time series data, and this market is predicted to grow to $15 billion by 2028 . The strength of algorithms lies in their speed and automation; they reduce the possibility of mistakes stemming from human emotions or reaction time and allow for trading frequency beyond what a human can manage. Trading algorithms also generate vast volumes of time series data, which cannot necessarily be deleted, due to compliance and forecasting needs. MongoDB, however, lets you set archival parameters, to automatically move that data into cheaper cloud object storage after a preset interval of time. This approach preserves your valuable storage space for more recent data. Using MongoDB products such as Atlas, materialized views, time series collections, and triggers, it is also possible to build a basic trading algorithm. Basically, time series data will be fed into this algorithm, and when the conditions are ideal, the algorithm can buy or sell as needed, thus executing a series of individual trades with cumulative profits and losses (P&L). Although you’ll need a Java app to actually execute the trades, MongoDB can provide a strong foundation on which to build. The structure of such an algorithm is simple. Time series data is loaded from a live feed into MongoDB Atlas, which will then input it into a materialized view to calculate the averages that will serve as the basis of your trades. You can also add a scheduled trigger to execute when new data arrives, thereby refreshing your materialized views, keeping your algorithm up to date, and not losing out on any buying/selling opportunities. To learn more, watch Wojciech Witoszynski’s MongoDB World 2022 presentation on building a simple trading algorithm using MongoDB Atlas, “Algorithmic Trading Made Easy.” Use case 2: IoT Due to the nature of IoT data, such as frequent sensor readings at fixed times throughout a day, IoT applications are ideally suited for time series collections. For example, Confluent, a leading streaming data provider, uses its platform alongside MongoDB Atlas Device Sync , mobile development services, time series collections, and triggers to gather, organize, and analyze IoT data from edge devices. IoT apps often feature high volumes of data taken over time from a wide range of physical sensors, which makes it easy to fill in meta fields and take advantage of densification and gap filling features as described above. MongoDB’s developer data platform also addresses many of the challenges associated with IoT use cases. To begin with, MongoDB is highly scalable, which is an important advantage, given the huge volumes of data generated by IoT devices. Furthermore, MongoDB includes key features to enable you to make the most of your IoT data in real time. These include change streams for identifying database events as they occur, or functions, which can be pre-scheduled or configured to execute instantaneously to respond to database changes and other events. For users dealing with time-based data, real-time or otherwise, MongoDB’s time series collections offer a seamless, highly optimized way to accelerate operations, remove friction, and use tools, such as triggers, to further analyze and extract value from their data. Additionally, users no longer have to manually bucket, query, or otherwise troubleshoot time series data; instead, MongoDB does all that work for them. Try MongoDB time series collections for free in MongoDB Atlas .

August 30, 2022

Hear From the MongoDB World 2022 Diversity Scholars

The MongoDB Diversity Scholarship program is an initiative to elevate and support members of underrepresented groups in technology across the globe. Scholars receive complimentary access to the MongoDB World developer conference in New York, on-demand access to MongoDB University to prepare for free MongoDB certification, and mentorship via an exclusive discussion group. This year at MongoDB World, our newest cohort of scholars got the opportunity to interact with company leadership at a luncheon and also got a chance to share their experience in a public panel discussion at the Community Café. Hear from some of the 2022 scholars, in their own words. Rebecca Hayes, System Analyst at Alliance for Safety and Justice I did an internal transition from managing Grants/Contracts to IT and just finished a data science certificate (Python, Unix/Linux, SQL) through my community college. My inspiration for pursuing STEM was wanting to understand how reality is represented in systems and how data science can be used to change the world. What was your most impactful experience as part of the Diversity Scholarship? Most impactful were the conversations I had with other attendees at the conference. I talked to people from all sectors who were extremely knowledgeable and passionate about shaping the future of databases. The opportunity to hear from MongoDB leaders and then understand how the vision behind the product was being implemented made me feel inspired for my future in STEM. How has the MongoDB World conference inspired you in your learning or your career path? MongoDB World inspired me to understand the real world applications of databases. I left knowing what's possible with a product like MongoDB and the limits of SQL and traditional databases. After the conference, I wrote this article on Medium reflecting on what I learned at the conference. What is your advice to colleagues pursuing STEM and/or on a similar path as you? Embrace what makes you unique. Just because things take time doesn't mean they won't happen. When learning programming and data science, think about how your work relates to the real world and share those thoughts with others. Seek out new perspectives, stay true to yourself, and keep an open mind. Delphine Nyaboke, Junior Software Engineer at Sendy I am passionate about energy in general. My final year project was on solar mini-grid design and interconnection. I have a mission of being at the intersection of energy and AI What inspired me to get into tech is the ability to solve societal problems without necessarily waiting for someone else to do it for you. This can be either in energy or by code. What was your most impactful experience as part of the Diversity Scholarship? My most impactful experience, apart from attending and listening in on the keynotes, was to attend the breakout sessions. They had lovely topics full of learnings and inspiration, including Engineering Culture at MongoDB; Be a Community Leader; Principles of Data Modeling for MongoDB; and Be Nice, But Not Too Nice just to mention but a few. How has the MongoDB World conference inspired you in your learning or your career path? MongoDB World has inspired me to keep on upskilling and being competitive in handling databases, which is a key skill in a backend engineer like myself. I will continue taking advantage of the MongoDB University courses and on-demand courses available thanks to the scholarship. What is your advice to colleagues pursuing STEM and/or on a similar path as you? STEM is a challenging yet fun field. If you’re tenacious enough, the rewards will trickle in soon enough. Get a community to be around, discuss what you’re going through together, be a mentor, get a mentor, and keep pushing forward. We need like-minded individuals in our society even in this fourth industrial revolution, and we are not leaving anyone behind. Video: Watch the panel in its entirety Raja Adil, Student at Cal Poly SLO Currently, I am a software engineer intern at Salesforce. I started self-teaching myself software development when I was a junior in high school during the COVID-19 pandemic, and from there I started doing projects and gaining as much technical experience as I could through internships. Before the pandemic I took my first computer science class, which was taught in C#. At first, I hated it as it looked complex. Slowly, I started to enjoy it more and more, and during the pandemic I started learning Python on my own. I feel blessed to have found my path early in my career. What was your most impactful experience as part of the Diversity Scholarship? My most impactful experience was the network and friends I made throughout the four days I was in New York for MongoDB World. I also learned a lot about the power of MongoDB, as opposed to relational databases, which I often use in my projects. How has the MongoDB World conference inspired you in your learning or your career path? The MongoDB World conference was amazing and has inspired me a ton in my learning path. I definitely want to learn even more about MongoDB as a database, and in terms of a career path, I would love to intern at MongoDB as a software engineer down the line. What is your advice to colleagues pursuing STEM and/or on a similar path as you? My advice would be to network as much as you can and simply make cool projects that others can use. Evans Asuboah, Stetson University I am an international student from Ghana. I was born and raised by my dad, who is a cocoa farmer, and my mum, who is a teacher. I got into tech miraculously, because my country's educational system matches majors to students according to their final high school grades. Initially, I wanted to do medicine, but I was offered computer science. I realized that computer science could actually be the tool to help my community and also use the knowledge to help my dad on the farm. What was your most impactful experience as part of the Diversity Scholarship? The breakout room sessions. As scholars, we had the chance to talk to MongoDB employees, and the knowledge and experiences changed my thoughts and increased my desire to persevere. I have learned never to stop learning and not to give up. How has the MongoDB World conference inspired you in your learning or your career path? Meeting these amazing people, connecting with the scholars, being at the workshops, and talking to the startups at the booths has made me realize the sky is the limit. I dare to dream and believe until I see the results. What is your advice to colleagues pursuing STEM and/or on a similar path as you? 1. Explore MongoDB; 2. You are the only one between you and your dream; 3. Take the initiative and meet people; 4. Never stop learning. Daniel Erbynn, Drexel University I love traveling and exploring new places. I am originally from Ghana, and I got the opportunity to participate in a summer program after high school called Project ISWEST, which introduced me to coding and computer science through building a pong game and building an Arduino circuit to program traffic lights. This made me excited about programming and the possibilities of solving problems in the tech space. What was your most impactful experience as part of the Diversity Scholarship? My most impactful experience was meeting with other students and professionals in the industry, learning from them, making lifelong connections, and getting the opportunity to learn about MongoDB through the MongoDB University courses. How has the MongoDB World conference inspired you in your learning or your career path? This conference has inspired me to learn more about MongoDB and seek more knowledge about cloud technology. What is your advice to colleagues pursuing STEM and/or on a similar path as you? Don’t be afraid to reach out to people you want to learn from, and create projects you are passionate about. Build your skills with MongoDB University's free courses and certifications . Join our developer community to stay up-to-date with the latest information and announcements.

August 12, 2022

4 Common Misconceptions About Security That Hackers Hope You Don't Know

I’ve always thought hacking was harder than it looks on TV. But after two years exploring the world of ethical hacking, I’ve learned that almost anyone can hack almost anything if they have sufficient knowledge and think creatively. Practically every week, we hear about another data breach at a large organization or a new vulnerability in a widely used software package. Many of these exploits are the result of misconceptions about security, which often lead to security misconfigurations. In this post, I'll cover the four most common security misconceptions and explain how hackers leverage them to execute attacks. I'll also explain how MongoDB can help you protect your data against these attacks. Watch Dawid Esterhuizen's MongoDB World 2022 presentation, Hack the MongoDB Planet! 1. NoSQL Injection The first rule of secure web development is: Never trust user input. This brings me to the first common misconception: MongoDB doesn’t use SQL, so I don’t have to worry about SQL injection. I’ve spoken with countless developers who have this false sense of security. The truth is, as NoSQL databases gain an increasing share of the database market, attackers are starting to pay attention, and new exploit tools are starting to emerge, like NoSQLmap (a NoSQL version of SQLmap), which allows hackers to perform automated NoSQL injections on vulnerable applications and endpoints. But enough theory. Let’s start hacking (ethically). Shown below is an example of a vulnerable application endpoint. The application uses the req.query.user and req.query.pass , which were provided by the user, as the values for name and password. The query string takes the username and password, builds a query, and checks to see if the user exists. app.get('/', async (req, res) => { query = {name: req.query.user, password: req.query.pass} user = await client.db("secret_bank").collection("users").findOne(query) if (user){ res.send('Success') } else { res.send('Failed') } } Vulnerable application endpoint. This is a fairly basic authentication method, but one that can be easily exploited with some command line tools. If you enter the correct username and password information, you can see the presence of strings on the server. (I’ve added some extra logging so you can see what happens on the server.) Normal login. The username and password are confirmed on the server and everything seems normal. MongoDB queries are usually built with objects and arrays. The question from an attacker’s point of view is, can they inject those onto the server? One of the interesting and not-very-well-known features of URL query parameters is that if you add a square bracket to a parameter in the URL, it is converted into an array. As you can see from this example, I added the password, notRight , as an array on the server. Add square brackets to the parameter to see the presence of an array. Say you’re an attacker and you want to inject a list of two values. You just have to do it twice with the same parameter and you’ll get an array list of values on the server. If you add square brackets to the same parameter twice in the command line, you'll get an array list of values on the server. This would definitely be useful for an attacker seeking to inject NoSQL into the application. The next question an attacker might ask is, can they inject an object on the server? In any programming language, if you set a field name inside the square brackets you get an object. Here, you see the same with an object on the server. If you set the field name inside square brackets, you'll see the object on the server. So the attacker can now send objects to the server, but how do they exploit that? In an SQL injection, the most basic authentication bypass you can do is quote (") or one equals one (1=1) double dash. This query results in true every time. Bypass login in SQL. The SQL query above will evaluate if the user is equal to admin, which in our case is true. Then it checks if the password is equal to an empty string (due to our quote), or if one is equal to one, which is always true. The query will therefore always return the ID of the username if it exists. There are a few NoSQL options that will do the same. If your password is actually set, you can run a query that checks if it's greater than nothing, which it always is, and use that to bypass authentication. Similar to an SQL injection, there are a few NoSQL queries hackers can run to bypass application authentication. This is a vulnerability in the application, not the database. The application assumes that the user is sending a string for the password and user. If the application forces the password and username to a string with .toString() , no matter what input the user gives, it is not parsed. A lot of this is pretty obvious, but let's take it a step further. Shown below are a few lines of Python script that iterate over some characters and play a little with regex, and then do some blind NoSQL injection. By doing that, I'll demonstrate how we can pull some data out of the database. baseURI="http://localhost:3000/?user=admin&pass[$regex]=^" chars=list(string.ascii_lowercase + string.ascii_uppercase + string.digits) count=0 while (not found): currentCharacter=chars[count] url=baseURI + foundString + currentCharacter + "" if requests.get(url).text == "Success": count=0 foundString+=currentCharacter url=baseURI+foundString+"$" found = requests.get(url).text == "Success": else: count+=1 Once we kick off the exploit, the script continuously iterates off every character. This Python script performs a blind NoSQL injection to extract a password. The main objective here is to extract the clear text password, since that was our vulnerable parameter in this case. This exploit is 13 lines of code that took about 10 minutes to write. With NoSQLmap, this could be even easier. Now, I know all you Java and C# developers are out there thinking your applications are safe. You'll be surprised to know that you're still vulnerable. All it takes is a very interesting approach to query building by parsing JSON strings as queries. In 2021, a quite popular open source chat server identified this exploit, using regex blind injection. The attackers in this case pulled out the reset tokens for users. Once they were able to reset a user’s password, they escalated the user’s privileges, making them an admin. Once an admin, they could set up integration, which is a fancy word for running code on your server. And once that happens, you've been pwned. So it bears repeating: Never trust any input from any user, ever! 2. Social engineering attacks People and businesses get hacked every day, and most security attacks and data breaches use social engineering. Great hackers will only let you know about their presence when they want you to know about it. This brings me to the second common misconception: I don't need authentication because the database will always be running on a private network. Authentication is the most basic security feature. Whether you're using SCRAM-SHA, LDAP, x509, or Kerberos, if you do not lock the front door, someone will gain access, and social engineering is the preferred method. All it takes is someone innocently clicking on a malicious file, which results in a macro trojan that gives the attacker access to the network. Once they're in your network, they can scan for open ports, then test if authentication is enabled for the database. Network scan. Once they find a database without authentication, you've been pwned. You've been pwned. It's that simple. 3. TLS and network traffic Let's say you have now enabled authentication, your data is now behind a lock and a strong password (at least 16 characters using both uppercase and lowercase letters, numbers, and symbols). You have even moved credentials into a secrets vault to stop attackers from reading them out of the source code or config files. You're safe, right? Well, this brings me to the third common misconception: We don't need TLS, MongoDB uses a binary protocol. Unfortunately, with the popularity of bring-your-own-device options, employees often install software from unofficial sources without any permission or oversight from IT. Vulnerabilities in third-party software can easily lead to unauthorized access to the network by a malicious actor. Once again, they can perform a scan and check to see if authentication is set up, and it is. Checking authentication. At this point, they can attempt a brute force attack, but that's a lot of work, and hackers prefer easier approaches. Instead, they start sniffing the network and find multiple packets for the MongoDB database. Sniffing the network. They can intercept the traffic and send it to their own machine. Once they have the data on their machine, and if they know their tools well (Tshark or Wireshark would help), they can access a PCAP file. And if they output it as JSON, they can use jq to manipulate it. The example below shows BSON as hexadecimal code. BSON dump. The CyberChef tool has a decoder for BSON. Decoding BSON. No username is required. And once again, you've been pwned. The lesson here is that TLS should always be used to ensure any data that is transferred between two systems is encrypted and, even if intercepted, cannot be decrypted without the private key. Retrieving files after they've been deleted So now you have authentication with strong passwords, the application is safe, the network is safe, and you’re using encryption at rest. You’re totally secure, right? Nope. This brings me to the fourth common misconception: Encryption at rest will protect me against hackers. A 2014 study found that 78% of drives that were sold for reuse after decommissioning and allegedly being wiped still had data on them; 23% of those still had associated social security numbers, and 21% had financial information. Needless to say, this is a problem. Hackers who use a keyfile for encryption at rest can retrieve files from an old, discarded drive or one that was not securely wiped. And once again, you've been pwned. The best defense against this sort of vulnerability is to always use a key management interoperability protocol (KMIP) server for encryption at rest. Malicious insiders It's important to remember that if an attacker gains entry to your system while it's running, they can get ahold of your data, no matter what system you're using. This brings me to an attack vector that is one of the biggest risks for businesses right now: malicious insiders. According to a recent study , 70% of organizations are seeing an increase in insider attacks. It can take up to 200 days to identify these attacks. Around the globe, 35% of businesses have been affected. Imagine a trusted employee accepts a job with one of your competitors and, in the process of accepting a position with them, decides to take your database with them. They don't want to use their credentials to access it for fear of being detected. They could dump the memory and get what's in the cache, but that's a lot of work. They could also dump the memory and look for the encryption keys, which is also a lot of work. Alternatively, they could just use the keyfile and dump the data. And once again, you've been pwned. Running host with KMIP. Insider attacks are some of the hardest to protect against because of the level of knowledge and access of employees. If it seems like no one is safe, that's what I'm trying to show you. If someone makes it onto your server while it's running, all bets are off, especially if they have root access. client-side field-level encryption encrypts the data on the application server before it's sent to the database — not even your DBAs can read the data. The keys that are being used are also in a key management system (KMS). Client-side field-level encryption should be used for all sensitive information. Secure by default Let's face it, security configuration is a full-time job, and you already have one of those. MongoDB Atlas provides a database that is secure by default. At a minimum, it always requires a username and password, it always requires you to use TLS, and there are network and IP access lists that further restrict access to the database. MongoDB Atlas uses a zero-trust design that complies with all the major regulatory frameworks organizations are subject to. When you use client-side field-level encryption for applications outside of mongoDB Atlas and put your data inside of Atlas, which is secure by default, you've added yet another layer between malicious insiders and your data. Try a preview version of Queryable Encryption to encrypt data end-to-end and query on randomly encrypted data. Try MongoDB Atlas for free.

August 10, 2022

Video: Canva's Lessons From Scaling MongoDB Atlas to 10 Billion Documents Across 100 Nodes

Running complex, global, and mission-critical infrastructure at scale is difficult, and anyone who has done it for any length of time usually has a few gnarly lessons to share. At MongoDB World in June 2022, we were lucky enough to feature someone who had done just that. Michael Pearson , software engineering team lead at Canva , gave a talk titled “10 Billion Documents: How Canva Scaled MongoDB to 100 Nodes.” I’ve had the pleasure of working alongside Pearson and his team for almost a year now, and his presentation focused on some of the massive challenges (and lessons) they’ve faced over the last two years as they have scaled into tens of terabytes of data and tens of billions of documents. I’m writing this blog to give a few highlights, but I’d recommend everyone check the original talk in full: A tricky problem For the uninitiated, Canva is a visual communication platform that empowers its users to design anything and publish anywhere. Or, as Pearson explained in his talk, “Canva is a really simple way to create beautiful designs and presentations.” Canva’s mission is to empower the world to design, and more than 85 million people in over 190 countries use the platform every month. As you can imagine, this presents a huge data challenge — and opportunity. Canva holds more than 10 billion designs and receives up to 30,000 document requests per second. The success of the platform comes down to providing a fantastic user experience every time, and to do that they need to present their customers with the right data at the right time. “This could be a really tricky problem for a database platform, particularly for a business based in Sydney with many users on the other side of the world,” said Pearson. MongoDB Atlas supports the Document Service, which enables opening, creating, updating, or deleting any design on Canva. The Document Service is critical for every single user — if the Document Service is down, then Canva’s users can’t design. But before we get too far into things, we should probably start with why Canva started using MongoDB in the first place. Flexibility to support rapidly changing requirements Michael Pearson, software engineering team lead at Canva. “Canva was launched to the world back in 2013, when MongoDB was very new to the scene,” explains Pearson. “I'm not sure if there were any other databases that would have been up for the challenge.” From those earliest days, MongoDB's flexible document model was the perfect fit for Canva's increasingly complex designs and document types. “The flexibility that MongoDB gave us in those early days was instrumental to our success. As the Canva platform evolved, we were throwing new schema and new features at it. MongoDB would just handle it.” Its continued innovation and problem-solving means MongoDB remains as valuable to us today as it was in 2012. Michael Pearson, software engineering team lead at Canva At the same time, it was essential that Canva’s engineering team was focused on building Canva, rather than time spent managing the data platform. With that in mind, Canva chose to run MongoDB as a service. After trying out multiple options, they went with MongoLabs and, in 2019, following MongoDB's acquisition of MongoLabs, Canva migrated onto MongoDB Atlas , running on AWS, where they remain to this day. Ten years of relative bliss “Before 2021, we had a very hands-off approach to how we used MongoDB,” said Pearson. “MongoDB just handled it. We didn't have to think about it at all." That's incredible, right? Think about it — for nearly a decade the team barely had to think about their data layer and could spend their time working on new features and making the actual product better for its millions of users around the world. It's what every developer wants. Eventually, though, Canva’s own success created certain challenges around scaling. With the stratospheric increase in growth, the load on the Document Service also continued to increase. MongoDB’s ability to scale horizontally through the use of sharding was critical to overcoming initial scale challenges, something that traditional database management systems would have struggled to achieve, said Pearson. With MongoDB, sharding is distributed or partitioned across multiple machines — useful when no single machine can handle large workloads. In due course, though, some attributes of Canva’s workload presented a new challenge. Said Pearson: “We were unique in that we have one cluster with one collection with a million chunks. Our documents are fairly large, given our product has evolved over the years and we put more and more stuff into our documents.” Or, Canva does many updates to relatively large documents, and by mid-2021 the surge in traffic was causing issues. “Our peak traffic caused three main problems: inability to run the balancer, latency issues, and a disk usage pretty much at capacity,” Pearson explained. “A really ineffective cache caused a really high write load to our cluster. This was causing downstream failures." Pearson discussed some of the tactical solutions the company took. “Disabling the balancer immediately brought us back to service, but now we knew that there was something wrong with that cluster and we couldn’t operate without the balancer,” said Pearson. “We also noticed that the number of chunks in our cluster had skyrocketed, from around 400,000 to just over a million.” Getting to the root of the problem The professional services team at MongoDB discovered that “metadata updates were causing anywhere from a one-second to five-minute stalls in the cluster.” Going from 400,000 chunks to a million chunks, at the rate of a minute of each change, was not optimal. There were three things to address with that cluster: reduce the number of chunks, reduce that disk contention, and reduce the size of documents. “With regard to reducing the number of chunks, we just took any contiguous chunks on a shard and merged them unconditionally,” said Pearson. “This was tooling built in collaboration with MongoDB.” After three months of merging chunks, Canva saw massive improvements in its cluster’s performance. A failure rate during reboot of around 4% dwindled to less than 1% during maintenance operations. Further, to address latency spikes and full-disk capacity, the team formulated a six-step plan to move from network-based storage volumes to locally attached disks. This has proved a huge success. “We were able to run the balancer. Our large spikes in latency were pretty much all gone, and our disk usage was almost at zero,” Pearson said. He continued: "The key takeaway for me is that sharding is great, but it's never a silver bullet. I don't think we would have caught these issues so quickly without such a thorough incident review process and such a close working relationship with MongoDB." What was learned? After presenting all of that information, Pearson closed out the presentation with a few key lessons. For anyone interested in running infrastructure at a massive scale, they are simple and worth taking note of: Take advantage of the flexible document model to accelerate your pace of development. Ensure chunks are distributed uniformly across the cluster in a consistent size. Maintain a thorough incident review process and include your trusted partners (such as MongoDB). Reliability is an essential part of Canva’s engineering practice, and prolonged service disruptions were particularly upsetting not only for engineers but for Canva’s global users. Pearson is glad to report that Canva has seen a turnaround in the number of incidents impacting its Document Service. This has freed the document team to shift focus back to shipping features and ensuring every user has a flawless experience using Canva. Interested in joining Canva as it pursues its mission to empower the world to design? Canva is looking for a software engineer to join its Core Data team. Want to take advantage of the flexible document model to accelerate your pace of development? Learn more about MongoDB Atlas .

August 2, 2022

10 Things We Learned at MongoDB World 2022

When you return to a normal routine after a long break, you find out how much you miss your old routine. After hosting MongoDB World remotely for two years, we were happy to get back to seeing people in person — almost 3,000 of them. Here’s a quick rundown of the top 10 things we learned at MongoDB World 2022. 1. Queryable Encryption was a hit How many times have you been to a concert and the opening act winds up being as good as the band you actually went to see? Queryable Encryption was like that at MongoDB World 2022. While a lot of attendees came to learn about MongoDB Atlas Search or Atlas Serverless Databases , they were equally intrigued by the ability to encrypt data in use and perform rich, expressive queries on encrypted data. This groundbreaking innovation is the result of a collaborative effort between Brown University cryptographer Seny Kamara, his longtime collaborator Tarik Moataz, and MongoDB. 2. Developers are in the driver's seat Starting with the opening keynote by MongoDB CEO Dev Ittycheria, MongoDB World reinforced the notion that developers are the key to the future success and productivity for today’s organizations. “Every product we build, every feature we develop, is all geared toward developer productivity,” Ittycheria said. In fact the entire event centered on powerful new tools that are now available in our developer data platform. In the Partner Promenade, dozens of vendors showed how they’re helping developers become faster and more productive. As Søren Bramer Schmidt, chief architect and founder of Prisma, explained, “New generations of developers are much bigger, and we can invest in better tooling for them. It’s an exciting time to be building tools for developers.” As the world increasingly goes digital, developers will be the key to companies’ success. Services, products, and advancements are inherently tied to the ability of developers to quickly build, iterate, and release. 3. Everyone's data is in motion The volume of data moving to the cloud is unprecedented. In a session titled “Connecting Distributed Data to MongoDB With Confluent,” Joseph Morais, cloud partner solutions architect for Confluent , cited a study that predicted 75% of all databases would be on a cloud platform by 2022. MongoDB senior vice president of product management, Andrew Davidson, said, “MongoDB has really broken through with the MongoDB Relational Migrator at the perfect time, since so many enterprises are accelerating their efforts to get off legacy relational databases and legacy on-premises estates to move to MongoDB Atlas.” 4. Public cloud security is not as easy as some people think While scores of businesses are increasing their cloud footprints with new cloud-native services and applications, securing them is becoming increasingly complex. Steve Walsh, senior solutions architect at MongoDB, gave a session titled “Securing Your Application's Data in the Public Cloud” and cited constantly changing cloud deployments and security policies in multi-cloud environments as reasons why security can be three times more complex in a multi-cloud environment. According to an ITRC study that Walsh cited, failure to configure cloud settings properly caused 30% of data breaches in 2021. MongoDB Atlas is designed to be secure by default , which simplifies the process of restricting access to sensitive data. 5. Ray Kurzweil might be even more prescient than he realizes On Day 3 of MongoDB World 2022, best-selling author, pioneering inventor, and futurist Ray Kurzweil delivered a wide-ranging keynote address covering everything from computational power to vaccine trials to life expectancy and literacy rates. In the address, Kurzweil said it was likely that an AI would pass a Turing test by 2029. Just days later, news reports came out about a Google engineer who’d been fired after claiming that an artificial-intelligence chatbot the company developed had become sentient , though the company dismissed the claims. 6. Attendees were eager to try MongoDB It’s easy to assume that everyone who came to MongoDB World was already using it and wanted to know about new features and capabilities. But in the Learn Booth at the event, plenty of visitors weren't using MongoDB at all — they were there to discover and evaluate. In the Ask the Experts booth, roughly one in 10 people asked about how to prepare to migrate to MongoDB. One of the most common questions we heard was, "How do I convert relational schemas to the document model?" We have tools like Relational Migrator to help with that. We also recommend training for developer and ops teams, including our MongoDB for SQL Pros university course and our Developer-Led Training programs to ramp them up on what makes MongoDB different from SQL. 7. Developer friction comes in many forms The opening keynote address and product announcements set the stage for many of the conversations we had over the next few days. We consistently heard from developers about the friction points that we could help eliminate for them, and how reducing developer friction results in real benefits — apps and services get launched that could not have existed otherwise because of the toll that complexity takes on development teams’ bandwidth. Atlas Serverless databases are going to be a big part of getting those new services off the ground because it’s one less thing developers have to worry about. And the MongoDB CLI allows developers to interact with our services using the method they’re familiar with — especially advanced developers who prefer control and speed over a more visual interface. 8. @MarkLovesTech draws the crowds MongoDB CTO Mark Porter was the center of the action at the event. Wherever he went, a crowd would gather, eager to meet, exchange thoughts, and ask questions. His talks during the Builder’s Fest were standing room only. Mark Porter delivers a short talk on scaling and managing teams at MongoDB World 2022. Photo by Eoin Brazil. 9. Every software company needs custom track jackets Our field marketing team knocked it out of the park with the custom track jacket. After MongoDB CEO Dev Ittycheria debuted the jacket during the Day 1 keynote , it immediately became the most desired piece of swag of the show. A few lucky contestants won their own track jackets during the Builder’s Fest. Developers are either highly fashion-conscious or avid joggers. 10. There's no replacement for in-person gatherings For almost three years, we’ve been getting by with remote events and Zoom calls, but we learned at least two more things from MongoDB World 2022: There’s no replacement for real-life, in-person experiences, and remote interactions actually require a different set of skills. “It is not impossible to talk with people on Zoom. But it requires so much more intentionality,” Mark Porter said. “My takeaway from MongoDB World is making sure that in this new hybrid world, we can talk with people! But even on Zoom, we must become much more focused on the intentionality of talking with them because it is so much different."

June 17, 2022

Highlights From MongoDB World 2022, Day 3

As we said on Day 1 , MongoDB World is a developer-focused event. And on Day 3, we really set out to prove it. The day got going with a keynote from best-selling author, pioneering inventor, and futurist Ray Kurzweil. His encyclopedic knowledge covers a wide range of topics and subject areas, and his talk was equally broad and freewheeling, touching on everything from computational power to vaccine trials to life expectancy and literacy rates. Kurzweil’s general viewpoint was overwhelmingly positive. He cited global poverty and literacy rates, per capita income, and the spread of democracy as examples of how the world is steadily becoming a better place to live. Not shy of making predictions, Kurzweil anticipates computational power roughly doubling each year, bringing AI ever closer to emulating human intelligence. In fact, he predicts that some AI systems will be able to pass the Turing test by 2029. And he sees humans eventually connecting directly to AI systems, expanding our emotional and intellectual intelligence far beyond our current state. He refers to this eventuality as the “ singularity ” and with it, human life will be changed forever. Minds were blown, but not so much that the developers in attendance weren’t ready to get down to doing what they love to do: building apps and writing code. Immediately after the keynote, Builder’s Fest kicked into gear in the Partner Promenade. The floor of the Jacob Javits Center was transformed by dozens of pods where MongoDB experts, partners, and customers gave hands-on tutorials showing how their services and applications integrated with the MongoDB developer data platform. Booming over the main sound system was a super-sized, four-person Mario Kart battle royale, where the victors won prizes like a Nintendo Switch. Another pod hosted a Price is Right–style game show, The Database is Right, where contestants drawn from the audience answered trivia questions about MongoDB, document databases, and database functions. Adjacent to the Bob Barker cosplay, MongoDB senior product manager Rob Walters gave an eager audience a live demo of how to configure the MongoDB Connector for Apache Kafka to use MongoDB as a source or a sink. Our Kafka connector enables developers to build robust, reactive data pipelines that stream events between applications and services in real time. Over on the Google Cloud Coding Stage, four developers competed to see who could build the closest version of the Google homepage in 20 minutes — without previewing their work. The blind coding test resulted in some fairly primitive approximations of the real thing, but all four contestants were praised for their high pressure creations. The winner of each round took home a limited edition MongoDB track jacket. MongoDB CTO Mark Porter joined in a number of Builder’s Fest activities, delivered several short talks, and often drew a crowd for impromptu Q&A. At one point he gave a “Chaos Presentation” — an improvised talk guided by randomly selected imagery — about the outages that inevitably occur in the public cloud, despite the exceptionally resilient infrastructures and high service levels. “Mirror image is an illusion,” Porter said. “A laptop is not staging, staging is not production, and production is not production.” Different regions have different hardware and configuration patterns that can build up over time, he said. “Staging has had far more rollbacks than production,” he said. “Find weaknesses in your architecture by doing post-mortems after an outage. Make staging environments reproducible by blowing them away from time to time. By making staging more predictable, over the course of a few years, you can make production more predictable.” In response to an audience question about what’s more important, implementing a culture of committing to rollbacks or automating it, he said, “The culture of rollbacks is what’s important, but at scale — meaning a couple thousand engineers — culture won’t be enough. You’ll need to automate some of it. But make it so rollbacks are not a bad thing.” A few pods over, developer advocate from Prisma , Sabine Adams, gave a talk entitled, “Giving MongoDB Guardrails.” His talk included step-by-step instructions, using the brand new MongoDB Atlas CLI , on how to ensure data consistency by providing an easy-to-read schema and a type-safe database client. First, he set up a MongoDB cluster in the CLI, then he initialized a TypeScript project with Prisma to model the data, and then used the Prisma CLI to create and retrieve some data. The Prisma client provides an API for reading data in MongoDB, including filters, pagination, ordering, and relational queries for embedded documents. If you want more highlights about MongoDB World 2022, read our Day One and Day Two recaps. For all those who attended the event, we’re happy you made it. For anyone who missed it, we hope to see you at next year's event.

June 10, 2022

Highlights From MongoDB World 2022, Day 2

Day Two of MongoDB World 2022 was all about the breakout sessions — more than 80 were on tap for the day. Things kicked off shortly after 8 a.m. with a discussion on empowering women and other underrepresented groups in the workplace, held in the IDEA Lounge . The 9 a.m. slot was packed with 10 sessions that ranged from building a sustainable ecosystem to the principles of data modeling to using Rust to build applications. Steve Westgarth, senior director of engineering at GSK (formerly GlaxoSmithKline) dove into the weighty topic of morality in the digital world and what developers ought to do when the software they build leads to unintended consequences. All too often, there’s immense pressure to release MVPs early — before all potential vulnerabilities have been vetted. Westgarth’s session sprang from a rhetorical question: “Do we as engineers have an ethical and moral responsibility to anticipate unintended consequences and how much personal responsibility should an individual take to ensure ethical management of data?” His discussion answered that with a Yes — developers do have to weigh the risk of unintended consequences, such as data breaches, versus the desire to maximize market opportunity. Westgarth urged developers to ask themselves what the unintended consequences are of the software they have in production, and to raise awareness of these issues in their organizations. A 15-minute lightning talk followed, with a session name that made it a popular draw for fans of worst-case scenarios: “Strange Cases From the Field.” Adam Schwartz, MongoDB director of technical services in EMEA, walked attendees through some especially challenging real-life technical support stories. He gave a detailed account of such curious cases as The Mistaken Hypotheses and The Unsuccessful Mitigations, and shared lessons he learned during years in the trenches as a support specialist. Closing on a positive note, he assured attendees that problem cases are rare, most cases have straightforward solutions, and exceptional cases are always a learning experience. Day One saw Mark Porter announce the MongoDB Relational Migrator , including a live demo of the product. On Day Two, lead product manager Tom Hollander did a deep dive into use cases, justifications, and future capabilities for the tool. MongoDB Relational Migrator imports and analyzes relational database schemas, maps them to an appropriate MongoDB schema, and transforms and migrates the data into MongoDB. Hollander said organizations can experience a 3x to 5x increase in development velocity and up to 70% in cost reductions by migrating away from relational models in favor of a more modern deployment such as MongoDB Atlas . Hollander said he anticipates future capabilities to include continuous replication, Kafka integration, application code generation, schema recommendations, and more. One company thriving in its legacy modernization efforts is Vodafone. The global head of engineering and transformation, Felipe Canedo, described Vodafone’s transition from a traditional telecommunications company to a Telco-as-a-Service (TaaS) provider. At the core of this transition was the creation of a scalable and open platform for the company’s engineers to innovate with complete freedom and flexibility. Canedo said Vodafone chose MongoDB because of its security, cloud-native high availability, support for multi-region and multi-cloud deployments, agile delivery, professional services, and ease of integration. The ultimate goal, Canedo said, was to provide Vodafone engineers with the best software experience possible. Day One also saw MongoDB CPO Sahir Azam announce the general availability of MongoDB Atlas serverless instances . On Day Two, MongoDB advisory solutions architect Carlos Castro gave a live demo of deploying a serverless database. In 15 minutes, starting from the Atlas dashboard, Castro took the audience step-by-step through the process of selecting a cloud provider, spinning up the instance, creating an app service, authentication, and users, and then setting up rules to allow users to access data on the instance. Serverless instances always run the latest version of Atlas, include always-on security, and enable customers to only pay for operations they run. Day Two also featured several discussions with leading experts and MongoDB partners. MongoDB senior vice president, product management, Andrew Davidson hosted a panel with three leaders in the effort to close the Developer Experience Gap : Peggy Rayzis, senior director of developer experience for Apollo GraphQL; Lee Robinson, director of developer relations for Vercel; and Søren Bramer Schmidt, chief architect and founder for Prisma. Rayzis cited Apollo’s supergraph as one way it's helping developers be more productive by unlocking their flow state. “When you’re in that flow state, you’re writing better code, making better decisions, and developing better value for consumers,” she said. Schmidt pointed out how the newest generation of developers stand to benefit the most from the proliferation of developer tools. “New generations of developers are much bigger and we can invest in better tooling for them,” Schmidt said. “It’s an exciting time to be building tools for developers.” Lee emphasized the important role the open source community plays in these tools. “People hear about Vercel through Next.js,” Lee said, “and we invest to give back to the open source community.” As gratifying and fun the first two days of World were, we really have something special in store for Day Three. It kicks off with a final keynote address by best-selling author, pioneering inventor, and futurist Ray Kurzweil. Day Three also features our Builder’s Fest , where even MongoDB CTO Mark Porter is expected to lend his considerable expertise to a few promising projects. With live game shows, chaos presentations, nerd battles and more, MongoDB World 2022 will finish on a high note. Check back tomorrow for more highlights from MongoDB World 2022.

June 9, 2022

Highlights from MongoDB World 2022, Day 1

MongoDB World is back in person at New York’s Jacob Javits Center after a three-year hiatus. Day One featured a jam-packed schedule of educational sessions, live tutorials, customer stories, and product announcements for a crowd of nearly 2,700 developers and IT professionals. The developer-focused conference got off to an early start with breakout sessions beginning at 8 a.m. Three sessions were on tap: an introduction to data modeling with MongoDB, a primer on MongoDB Atlas Search , and a tutorial on getting started with MongoDB Atlas . In that tutorial, MongoDB solution architect Tom Gleitsmann explained how, out of all the challenges developers face on a daily basis, the common denominator is friction. Gleitsmann gave a crisp and informative summary of MongoDB Atlas features that were engineered specifically to reduce the amount of friction developers face, including ease of deployment, security by default, data visualization, the Performance Advisor , alerts, and backup scheduling, to name a few. The early-morning sessions were followed by a keynote delivered by MongoDB CEO Dev Ittycheria and Chief Product Officer Sahir Azam celebrating the company’s rapid growth, setting out a vision for its future, and highlighting several of its customers. The executives were joined on stage by Vercel founder and CEO Guillermo Rauch, Wells Fargo head of digital enablement Catherine Li, Avalara VP of software engineering John Jemseck, and several MongoDB product experts, each providing insight into the latest enhancements to MongoDB. The biggest reveal, though, was a new vision for MongoDB Atlas and the products that work seamlessly with it, such as Atlas Search and Atlas Data Federation . “We believe that developers want to build on a modern data model that's designed to the way they think and the way they code,” Ittycheria said. “And we also believe that developers want an elegant developer experience that makes their lives so much easier. And they want all this in one unified platform. What they need is a developer data platform.” After the morning keynote, sessions ran back-to-back until lunch. They ranged from quick, 15-minute “chalk talks” to hour-plus deep dives. In one, MongoDB software engineer James Wang gave a hands-on tutorial on using our data visualization tool, MongoDB Atlas Charts , which is fully integrated with MongoDB Atlas. Wang showed how easy it is to link data sources in just a few clicks. Using a fictitious company, he demonstrated step-by-step how to embed data visualization via code snippets and an SDK, share the data with others using a public link, filter data inside the admin web page, and restrict access to authorized users. Attendees followed along on their own laptops and were quickly able to replicate the visualizations. In another talk, Keller Williams’ senior architect Jim McClarty shared some of the real-world impact of Atlas — how it has accelerated the real estate firm’s ability to innovate its applications, how essential Atlas Search is in their applications, and how Charts has become “the best hidden feature in Atlas.” Attendees shuttled from room to room like they had places to go and people to meet, which they did. MongoDB principal, industry solutions, Felix Reichenback took attendees through mobile sync and why developers often waste tons of time trying to build their own sync tool that fails to handle conflict resolution because of the intermittent nature of mobile connections. Next, Michael van der Haven, VP at consulting giant CGI and expert in cloud-native platforms, explained how he helped the energy industry’s open source architecture group, OSDU, migrate away from Elasticsearch, simplify its architecture by removing memory-intensive indexes, and reduce OPEX by six figures using MongoDB Atlas. After lunch, MongoDB CTO Mark Porter gave an energetic keynote, announcing several more new products and features, including the new MongoDB Atlas CLI , the general availability of the Data API , and, perhaps our biggest announcement of the day, Queryable Encryption , which allows users to search their databases while sensitive data stays encrypted. Available in preview, Queryable Encryption offers a big step forward in protecting sensitive data. Porter gave personal anecdotes illustrating many of the hurdles developers have to overcome that have nothing to do with building software, such as rigid and fragile relational databases, and working with SQL, a language that developers early in their careers or fresh out of school have no desire to work with. Porter’s keynote address included a live demo of the Relational Migrator, which, while risky to perform in front of an audience, went off flawlessly. Meanwhile, a series of events kept the IDEA Lounge a lively place, including a great panel discussion called Our Journey: Being Black in Tech. And a floor below the workshops, more than a dozen MongoDB partners demonstrated their platforms and related products — including many of the companies named MongoDB Partners of the Year . The schedule for Day 2 is equally packed, with more than 80 sessions that include partner showcases, strange cases from the field, book club sessions, more deep dives into product announcements and tutorials, and talks on diversity, equity, and inclusion. In the afternoon, MongoDB celebrates Pride with food, drinks, and entertainment at the historic Stonewall Inn. And MongoDB World 2022’s biggest event happens at the end of the day — “The Party,” featuring music from The Midnight and Don Diablo, as well as retro arcade games and an open bar. Check back tomorrow for more highlights from MongoDB World 2022.

June 8, 2022