Transitioning from Relational Databases to MongoDB - Data Models

MongoDB

This post was written by Bryan Reinero, a Consulting Engineer at MongoDB. Understanding how to use MongoDB isn’t difficult, but it does require you to change the way you think about databases if you are coming from a traditional relational database management system (RDBMS). This blog post is designed to help you understand data modeling and schema design in MongoDB from the perspective of someone used to programming with an RDBMS. I’ll explore the fundamental differences between RDBMS and MongoDB and highlight the advantages and design compromises of each approach.

Fundamental Differences The immediate and fundamental difference between MongoDB and an RDBMS is the underlying data model. A relational database structures data into tables and rows, while MongoDB structures data into collections of JSON documents. JSON is a self-describing, human readable data format. Originally designed for lightweight exchanges between browser and server, it has become widely accepted for many types of applications.

JSON documents are particularly useful for data management for several reasons. A JSON document is composed of a set of fields which are themselves key-value pairs. This means each JSON document carries its own human readable schema design with it wherever it goes, allowing the documents to easily move between database and client applications without losing their meaning.

JSON is also a natural data format for use in the application layer. JSON supports a richer and more flexible data structure than tables made up of columns and rows. In addition to supporting field types like number, string, Boolean, etc., JSON fields can be arrays or nested sub-objects. This means we can represent a set of sophisticated relations which are a closer representation of the objects our applications work with. Using JSON documents in our database means we don’t need an object relational mapper between our database and the applications it serves. We can persist our data in the right form for our application.

Let’s dive into an example. Imagine I have an application dealing with information describing vehicle information including make, manufacturer, and category. My documents might look like this:

  {
    "_id" : ObjectId("528ba7691738025d11aab772"),
    "manufacturer" : "Porsche",
    "name" : "550 Spyder",
    "category" : [
        "sports",
        "touring",
        "coupé"
    ]
  }

It’s pretty clear what this document describes, and I could easily unmarshall this into an object native to my chosen language. Notice also that the “category” field is an array of strings. The ability to support arrays is an especially helpful feature; it simplifies the way my application interfaces with the database and helps me avoid a complicated database schema. Consider the complexity of supporting a repeating group in a properly normalized table structure. To represent the same data object in a single table row would like like this:

PK | Name | Manufacturer | categories 123 | “550 Spyder” | “Porsche” | “sports,touring,race,coupé”

In the real world, I know that vehicles belong to multiple categories, so I’ve chosen to represent that relation with a comma separated list: “sports,touring,race,coupé”. However, a comma separated list can be difficult to work with because:

  • I can’t use equality to match single embedded values
  • I must use regular expressions to find data

Which means that:

  • Aggregate functions are difficult
  • Updating a specific element is difficult

The first normal form was designed to avoid such problems by requiring that each relation contain a single atomic value. In other words, No repeating groups!

MongoDB’s JSON arrays do a far better job of matching the semantics of our categories than a set of rows since arrays are multi-element types, representing lists and sets by nature. As a document store, MongoDB’s multi value fields are a natural fit.

The Normal Forms and Why We Don’t Need Them All of this is nice, but a perfectly valid argument can be made to implement the first normal form properly in my relational database and all the trouble I have with repeating groups will be solved. To adhere to the first normal form, I could restructure my rows to look like this:

Vehicle_Id | Manufacturer | Name | Category


2253 | “Porsche” | “550 Spyder” | “sports” 2253 | “Porsche” | “550 Spyder” | “touring” 2253 | “Porsche” | “550 Spyder” | “coupé”

This looks great, but I’ve clearly created data redundancies that make me vulnerable to anomalies. For example, consider an update to a single row:

update Manufacturer = "Porsche AG" where Vehicle_Id = 2253 and Category = "coupé"

This update leads to the following data anomaly.

Vehicle_Id | Manufacturer | Name | Category


2253 | “Porsche” | “550 Spyder” | “sports” 2253 | “Porsche” | “550 Spyder” | “touring” 2253 | “Porsche AG” | “550 Spyder” | “coupé”

This happened because I failed to adhere to the second normal form. To avoid this problem I’d have to normalize my data into three separate tables.

Vehicle_Id | Name


2253 | “"550 Spyder”

Vehicle_Id | Category


2253 | “sports” 2253 | “touring” 2253 | “coupé”

Vehicle_Id | Manufacturer


2253 | “Porsche AG”

This structure prevents the anomaly in the first example, but introduces a new set of problems. The new schema has become much more complex, and I have lost any semantic understanding of my stored objects. By adhering to these two normal forms, I’ve also entered the realm of cross-table joins and the need to enforce ACID transactions across multiple tables. The enforcement of referential integrity in multi-row, multi-table operations will require concurrency controls, increasing overhead and affecting performance.

MongoDB avoids these complications through the use of a document data model. The JSON document forms a more natural representation on the data than the normalized schema we explored. It does this by allowing the embedding of related data via arrays and sub-documents within a single document - thus eliminating the need for JOINs, referential integrity and multi-record ACID transactions

The Tao of MongoDB { “_id” : ObjectId(“528ba7691738025d11aab772”), “manufacturer” : “Porsche AMG”, “name” : “550 Spyder”, “categories” : [ “sports”, “touring”, “coupé” ] }

The data anomalies are avoided since the denormalized JSON structure has a single field “manufacturer.” As an attribute of the single document, the “manufacturer” field can be modified atomically, and consistency is preserved across all relations in the document. Using JSON gives us less reason to adhere to the normal forms. This doesn’t mean that you are prohibited from normalizing data in MongoDB. Of course you can, but the the reasons why are going to be different than those of a traditional RDBMS. We’ll look into that in a subsequent post on schema design.

For a full-length presentation on moving from Relational Databases to MongoDB, please visit the MongoDB presentations page. You can learn how to use MongoDB with our online education courses here. And please contact MongoDB support with any questions about transitioning to or using the product. You can also download our new whitepaper which provides best practices and considerations for migrating from an RDBMS to MongoDB.

More Information

This post was updated in January 2015 to include additional resources and updated links.