How to Model Data in Document Databases for Read and Write Performance

MongoDB is often seen as a good choice for storing unstructured data. The ability to persist data in MongoDB without defining what type of data it is or designing schema for it is one of the reasons many of our customers choose us. But the idea that MongoDB is a "schemaless" database is not accurate. Although a document database does allow you to store data without defining what it is, the shape of that data matters if you plan to do more than simply retrieve whole documents by keys.

The Need to Model for NoSQL

At this year's MongoDB World event, Daniel Coupal, staff developer advocate at MongoDB, explained that the need to model data in a document database is due to the presence of constraints that must be taken into account when you're persisting data in MongoDB. Constraints include things like network and hard disc speed, maximum file size of documents, and features that you don't have now but might add later.

"If you look at the stack of an application, you have the application that talks to MongoDB that talks to whatever layer you add — it could be the cloud or a physical machine — those constraints are going to map to some of those layers," Coupal said. "It's imperative to know the features of the products you use." MongoDB offers features like transactions, field-level encryption, data federation, and archives, which require that you model data differently.

In relational databases, data modeling is fairly straightforward due to the nature of third normal form (3NF) — the database schema design approach for relational databases. So, essentially, there’s only one solution for modeling the database. With the document model and MongoDB, however, you have several options for data modeling. You can nest everything under a single collection, and possibly wind up with duplicate sets of data (and, therefore, data concurrency issues), or you can use separate collections for different datasets and avoid duplicate data altogether. Ultimately, according to Coupal, "the optimal grouping of objects into collections is determined by the workload."

During the session, Coupal provided a breakdown of data modeling methodology that involves a three-phase process starting with the workload, proceeding to relationships, and moving to patterns for optimization purposes. "In a lot of the solutions we're trying to build with NoSQL, performance is a top requirement," he said. He also cited better performance as one of the big reasons why people switch from SQL to MongoDB.

What are the Data Access Patterns?

In essence, with MongoDB, the way you plan to access the data determines the way you store it in the database. Data that is accessed together should be stored together.

In the session, Coupal also presented an insightful analogy between the nature of data modeling in relational databases versus the document model in MongoDB. Essentially, the difference is that with the relational model, if you have a car and you want to model it, you'll take each part of the car individually and place it in its own table. Then, when you want to use the car, you have to reassemble it part by part (and table by table) before you can drive it. With MongoDB, you take the car and put it in the garage (the equivalent of a collection). When you want to use it, you take it out. That's it. "We do only one read on the disc to get everything we need together," Coupal said.

Techniques for Data Modeling in MongoDB

Coupal also provided an explanation of two different data modeling techniques: referencing and embedding. Embedding is a way to combine what would normally be two tables in a tabular database into one using an array. "The array is the expression of the one-to-many relationship," Coupal said. Referencing is useful for when the "many" side of the relationship is a huge number. Although MongoDB does support transactions, in almost all cases, it's better to use a document for more efficient read-write performance.

As we know, developers are the ones who are most likely to understand the data access patterns for their applications. Properly designed schemas can increase performance for a given set of hardware by reducing computation, I/O, and contention. What really differentiates MongoDB from relational databases is the ability to co-locate related data in the atomic unit of storage so multiple values for an attribute can exist within a single record rather than being broken up into rows and stored independently. A document database with a properly designed schema lets you filter and retrieve data with minimal computational overhead and in a single I/O operation. This approach can make finding and retrieving data far faster and less expensive.

To see the complete session from MongoDB World 2022, which includes a list of 12 data modeling patterns and techniques for evolving schema in MongoDB, watch The Principles of Data Modeling for MongoDB.