Schema Design Anti-Patterns - Part 1
Rate this video
00:00:00Introduction to Schema Design Anti-Patterns
- Introduction to the importance of schema design patterns.
- The tendency of developers to skip best practices.
- MongoDB's role in identifying schema design anti-patterns.00:01:49The Importance of Schema Design Patterns
- Schema design patterns help developers be successful.
- The consequences of not considering schema design from the start.00:02:17MongoDB Atlas and Schema Anti-Patterns
- MongoDB Atlas features that help identify anti-patterns.
- The goal to educate developers on recognizing and preventing anti-patterns.00:03:57Understanding Schema Design Anti-Patterns
- Explanation of schema design anti-patterns.
- The video series will discuss six anti-patterns, with this video covering the first two.00:07:57Massive Arrays Anti-Pattern
- The problems with massive, unbounded arrays.
- How massive arrays can exceed document size limits and reduce index performance.00:10:07Modeling Data to Avoid Massive Arrays
- Different modeling strategies to avoid massive arrays.
- Examples of embedding and referencing data.00:14:14Aggregation Pipeline and $lookup
- How to use MongoDB's aggregation pipeline and $lookup to join data from multiple collections.00:17:57Massive Number of Collections Anti-Pattern
- The drawbacks of having a massive number of collections.
- How excessive collections and indexes can impact database performance.00:20:39Restructuring Data to Reduce Collections
- An example of restructuring data to reduce the number of collections.
- The benefits of a more efficient schema.00:24:02Tools for Investigating Collections
- Using MongoDB Atlas, Compass, and the MongoDB Shell to investigate and manage collections.00:25:58Summary and What's Next
- Recap of the massive arrays and massive number of collections anti-patterns.
- Teaser for the next video in the series, which will discuss additional anti-patterns.The primary focus of the video is to educate viewers on MongoDB schema design anti-patterns, specifically the issues with massive arrays and a massive number of collections, and how to restructure data to avoid these pitfalls.
🔑 Key Points
- Design patterns provide best practices and a common language for architects.
- MongoDB schema design patterns are crucial for successful application planning and iteration.
- Developers sometimes neglect schema design best practices, leading to inefficient and problematic code.
- MongoDB Atlas can identify anti-patterns in your database.
- The video series will cover six common schema design anti-patterns, with this video focusing on the first two: massive arrays and a massive number of collections.
🔗 Related Links
- https://mdb.link/free-8CZs-0it9r4
- https://mdb.link/community-8CZs-0it9r4
- https://bit.ly/33i3wgU
- https://bit.ly/3cIgEiy
- https://bit.ly/3jieJns
- https://bit.ly/3097PJy
- https://bit.ly/39DpvAv
- https://bit.ly/33g99MR
- https://bit.ly/3bpg1Z1
- https://bit.ly/2LjtNBZ
- https://bit.ly/3fH87gR
- https://bit.ly/3fEaIsd
- https://bit.ly/2SY9w90
- https://bit.ly/3bn9bDv
- https://bit.ly/3fHoqdJ
Full Video Transcript
do you remember this book it's called design patterns it's written by the gang of four or maybe you're a little hipper maybe you're more familiar with this book it's called headfirst design patterns software developers love design patterns why well design patterns give us best practices in a common language as we architect applications when modeling data to be stored in a mongodb database we talk about schema design patterns these schema design patterns help developers be successful as they plan and iterate on their schema designs but here's the thing sometimes developers jump right into building their apps without thinking about schema design best practices and without even realizing it they've coded themselves into a big mess i know i'm guilty of this sometimes i get so focused on creating the mvp that i don't think about the bigger picture i don't worry about what will happen is my app evolves or my requirements change and my database needs to scale i just care about doing the minimum work to get something working and that's why we at mongodb have identified six common schema design anti-patterns our goal is for you to learn about and prevent schema design anti-patterns so as you're rapidly coding and trying to get stuff done you don't accidentally code a big mess we want you to be able to identify when you're starting to go down a dangerous path that will lead to inefficient queries now unfortunately you're not always going to recognize when you're going down the wrong path that's why the mongodb atlas team added a feature that will identify anti-patterns in your database for you if you're not familiar with mongodb atlas it's mongodb's fully managed database as a service now i like using it because it has a generous free tier and also because frankly i don't like managing database servers myself atlas now identifies schema anti-patterns in your databases so here you can see that i should be using case insensitive indexes when i'm not not so great but atlas is helping me identify the anti-pattern and has tips on how i can fix the issue in this video i want to help you understand the schema design anti-patterns so that if atlas shows you a warning like this you understand what it means even better i want to help you understand the anti-patterns so you can prevent them from occurring in your database in the first place in this video series i'll be discussing six mongodb schema design anti-patterns today i'll be discussing the first two massive arrays and massive number of collections in part two i'll discuss the next three those are unnecessary indexes bloated documents and case insensitive queries without case insensitive indexes and finally in part three i'll discuss the final anti-pattern separating data that is accessed together i'll also summarize all of the anti-patterns in that video each video in this series will be about 20 minutes long now if you're watching this series i'm going to assume you're comfortable with introductory mongodb terminology like document collection and database i'm also going to assume you're comfortable with the basics of how to store data using the document model if you're not familiar with those terms and concepts or you're brand new to mongodb you may want to check out the free mongodb university course m001 mongodb basics and remember all mongodb university courses are free now if you have a background in sql you may want to check out m100 mongodb for sql pros now let's kick things off with the first anti-pattern massive arrays one of the rules of thumb when modeling data in mongodb is data that is accessed together should be stored together if you'll be retrieving or updating data together frequently you should probably just store it together now data is commonly stored together by embedding related information in sub documents or arrays now sometimes developers take this too far and embed massive amounts of information in a single unbounded array and that's where we see the massive arrays anti-pattern now we see two problems commonly pop up when developers start creating these massive unbounded arrays first up massive arrays can cause a document to exceed the 16 megabyte document size limit second index performance on arrays decreases as array size increases now we want our index performance to be excellent so that our queries will be blazing fast now let's take a look at an example from the greatest tv show ever created parks and recreation let's say we want to store information about government employees like these beautiful people you see in this picture from parks and rec and the buildings where they work let's explore some different ways to model this data let's begin by considering embedding information about the employees inside of the building documents here we can see a document for city hall inside of the document we have an array named employees that contains a sub document for leslie and a sub-document for ron now the employees array is unbounded as we begin storing information about all of the employees who work in city hall the employees array will become massive potentially sending us over the 16 megabyte document maximum additionally if we create an index on the employees array the performance of that index will decrease as we add more and more employees so this is an example of the massive arrays anti-pattern so how can we fix this let's discuss a few options instead of embedding the employees in the buildings document we could flip the model and instead embed the buildings in the employees documents if we take a look at the documents for leslie and ron we can see we are repeating the information about city hall in the document for each city hall employee if we are frequently displaying information about an employee and their building in our application together this model probably makes sense the disadvantage with this approach is we have a lot of data duplication storage is cheap so data duplication isn't necessarily a problem from a storage cost perspective however every time we need to update information about city hall we'll need to update the document for every employee who works there if we take a look at the information we're currently storing about the buildings updates will likely be very infrequent so this approach may be a good one let's consider the case where information about employees and their building do need to be frequently displayed or updated together then we may want to separate the information into two collections and use references to link them here we are creating a manual reference between the employees building id which in this case is city hall and the building's underscore id which as you can see is also city hall here we have completely separated our data we have eliminated massive arrays and we have no data duplication the drawback is that if we need to retrieve information about an employee and they're building together we'll need to use dollar lookup to join the data together dollar lookup is a stage available as part of the aggregation pipeline as we saw when we duplicated data previously we should be mindful of duplicating data that will be frequently updated now in this particular case the name of the building and the state the building is in are very unlikely to change so this solution works now i have a feeling some of you are not familiar with the aggregation pipeline and dollar lookup so let me give you a high level look at how it works we can use dollar lookup to join data from the buildings and employees collection together we'll begin by accessing the building's collection and aggregating on it we're going to create a pipeline with only one stage in this example but pipelines can have multiple stages we'll create a stage for dollar lookup we want to look up documents from the employees collection the local field is the field in the buildings collection that i want to join on in this case i want to join on underscore id the foreign field is the field in the employees collection that i want to join on in this case i want to join on building underscore id the last thing i need to configure for dollar lookup is the name of the array where the information from the employee's collection will be stored i'll call it employees now on the right you can see a document that was returned as a result from the aggregation in the document you can see information from the buildings document and you can also see an array of documents from the employees collection that have the same building id so dollar lookup can be pretty handy when you need to combine information from more than one collection now keep in mind that dollar lookup operations can be expensive so it's important to consider how often you'll need to perform dollar lookup if you choose this option if we find ourselves frequently using dollar lookup another option is to use the extended reference pattern the extended reference pattern is a mixture of the previous two approaches where we duplicate some but not all of the data in the two collections we only duplicate the data that is frequently accessed together for example let's say our application has a user profile page that displays information about the user as well as the name of the building and the state where they work we could embed some of the building information like the name and state fields in the employee documents and then leave the complete building information in separate documents here you can see we've added the building's name and state in each of the employee documents the building documents contain the complete building information so we can query for that as needed now as we saw we duplicated data previously we should be mindful of duplicating data that will frequently be updated in this particular case the name of the building and the state the building is in are very unlikely to change so this solution works so those are a few different options for how to model employee and building data to avoid massive arrays in order to determine which model is best we really need to understand how the data will be used so let's summarize the massive arrays anti-pattern do store information together that you'll be frequently querying together remember data that is accessed together should be stored together don't store information in massive unbounded arrays that will continue to grow over time as is true with all mongodb schema design patterns and anti-patterns carefully consider your use case in order to determine what schema design is best for you alright so we've already discussed that massive arrays are bad but what about having a massive number of collections turns out they're not particularly great either let's begin by discussing why having a massive number of collections is an anti-pattern if storage is cheap who cares how many collections you have well empty and unused indexes drain resources why am i talking about indexes well every collection in mongodb automatically has an index on the underscore id field while the size of this index is pretty small for empty or small collections thousands of empty or unused indexes can begin to drain resources collections will typically have a few more indexes to support efficient queries all of these indexes add up wired tiger is mongodb's default storage engine wired tiger performance decreases with an excessive number of collections and indexes wire tiger stores a file for each collection and a file for each index wire target will open all files upon startup so performance will decrease when an excessive number of collections and indexes exist in general we recommend limiting each replica set to 10 000 collections when users begin exceeding 10 000 collections they typically see decreases in performance let's take a look at another example now leslie is the main character in parks and rec you can see her on the left of this picture she is incredibly passionate about maintaining the park she oversees and at one point she takes it upon herself to remove the trash in the pawnee river let's say she wants to keep a minute by minute record of the water level and temperature of the pawnee river the eagleton river and the wamapoke river so she can look for trends she could send her co-worker jerry to put 30 sensors in each river and then begin storing the sensor data in a mongodb database let's take a look at one way to store this data i'm going to jump over to atlas so we can see this in action one way to store the data would be to create a new collection every day to store sensor data each collection would contain documents that store information about one reading for one sensor let's take a look at her indexes let's say that leslie wants to be able to easily query on the river and sensor fields so she creates an index on those fields for every collection let's take a look at her database stats leslie has stored information for every minute of 2019 in this database her database is 5.2 gigabytes her index size is 1.07 gigabytes and she has 365 total collections each day she creates a new collection and two indexes as leslie continues to collect data and her number of collections exceeds 10 000 the performance of her database will decline also when leslie wants to look for trends across weeks and months she'll have a difficult time doing so since her data is spread across multiple collections so leslie realizes this isn't a great schema so she decides to restructure her data let's take a look at that this time she decides to keep all of her data in a single collection she buckets her information so she stores one hour's worth of information from one sensor in each document let's take a look at her indexes leslie wants to query on the river and sensor fields so she creates two indexes for this collection one for river and one for sensor and let's take a look at her database stats her new database is 3.07 gigabytes her index size is 27.45 megabytes and she has only one collection let's compare those stats side by side with her initial schema by restructuring her data leslie sees a massive reduction in her index size she went from 1.07 gigabyte initially to just 27.45 megabytes she now has a single collection with three indexes with this new schema she can more easily look for trends in her data because it's stored in a single collection also she's using the default index on underscore id to her advantage by storing the hour the water level data was gathered in this field mongodb automatically creates an index on the underscore id field if she wants to query by hour she can use the default index on underscore id to do so now in this example leslie was able to remove unnecessary collections by changing how she stored her data sometimes you won't immediately know what collections are unnecessary so you'll have to do some investigating yourself so how do you know if you should drop a collection well if you find an empty collection you can drop it also if you find a collection whose size is made up mostly of indexes you can probably move that data into another collection and then drop the original you might be able to use dollar merge to move the data from one collection to another let's talk about what tools you can use to do your investigating let's start in atlas because that's where we've been working already we've already seen we can use the atlas data explorer to do some investigating we can check out the database size the index size and the number of collections now let's do some investigating in compass compass is mongodb's gui you can use compass with any mongodb database regardless of where it is hosted now similar to the atlas data explorer you can browse stats about your database you can also select a database to see details about collections in that database so here you can see things like the number of documents the average document size the total document size the number of indexes and the total index size now i know some of you will always prefer working in the command line over a gui so let me show you one more way to investigate let's use the [ __ ] shell so let me connect to the river stats database and input my password all right we're all connected so i can run something like db.get collection names to get a list of the collections in this database i can retrieve stats about my database by running db.stats here can see stats just like i did in atlas and compass things like storage size number of collections and number of indexes i can run db.getcollection with the name of my collection in this case i'll do 201901 and we'll call dot stats and here i get a ton of detailed stats about a collection so you can use any of these tools atlas compass or the [ __ ] shell to do your investigating let's summarize this massive number of collections anti-pattern don't create a massive number of collections as each collection likely has a few indexes associated with it an excessive number of collections and their associated indexes can drain resources and impact your databases performance in general try to limit your replica set to 10 000 collections do remove empty and unused collections all right so to summarize what i've discussed today don't create massive unbounded arrays and don't create a massive number of collections be on the lookout for part two in this series where i'll discuss the next three schema design anti-patterns see you soon