Massive Number of Collections
Lauren Schaefer, Daniel Coupal6 min read ā¢ Published Feb 12, 2022 ā¢ Updated Oct 01, 2024
Rate this article
In the first post in this MongoDB Schema Design Anti-Patterns series, we discussed how we should avoid massive arrays when designing our schemas. But what about having a massive number of collections? Turns out, they're not great either. In this post, we'll examine why.
Are you more of a video person? This is for you.
Let's begin by discussing why having a massive number of collections is an anti-pattern. If storage is relatively cheap, who cares how many collections you have?
Every collection in MongoDB automatically has an index on the _id field. While the size of this index is pretty small for empty or small collections, thousands of empty or unused indexes can begin to drain resources. Collections will typically have a few more indexes to support efficient queries. All of these indexes add up.
In general, we recommend limiting collections to 10,000 per replica set. When users begin exceeding 10,000 collections, they typically see decreases in performance.
To avoid this anti-pattern, examine your database and remove unnecessary collections. If you find that you have an increasing number of collections, consider remodeling your data so you have a consistent set of collections.
Let's take an example from the greatest TV show ever created: Parks and Recreation. Leslie is passionate about maintaining the parks she oversees, and, at one point, she takes it upon herself to remove the trash in the Pawnee River.
Let's say she wants to keep a minute-by-minute record of the water level and temperature of the Pawnee River, the Eagleton River, and the Wamapoke River, so she can look for trends. She could send her coworker Jerry to put 30 sensors in each river and then begin storing the sensor data in a MongoDB database.
One way to store the data would be to create a new collection every day to store sensor data. Each collection would contain documents that store information about one reading for one sensor.
1 // 2020-05-01 collection 2 { 3 "_id": ObjectId("5eac643e64faf3ff31d70d35"), 4 "river": "PawneeRiver", 5 "sensor": 1 6 "timestamp": "2020-05-01T00:00:00Z", 7 "water-level": 61.56, 8 "water-temperature": 72.1 9 }, 10 { 11 "_id": ObjectId("5eac643e64faf3ff31d70d36"), 12 "river": "PawneeRiver", 13 "sensor": 2 14 "timestamp": "2020-05-01T00:00:00Z", 15 "water-level": 61.55, 16 "water-temperature": 72.1 17 }, 18 ... 19 { 20 "_id": ObjectId("5eac643e64faf3ff31d70dfc"), 21 "river": "WamapokeRiver", 22 "sensor": 90 23 "timestamp": "2020-05-01T23:59:00Z", 24 "water-level": 72.03, 25 "water-temperature": 64.1 26 } 27 28 // 2020-05-02 collection 29 { 30 "_id": ObjectId("5eac644c64faf3ff31d90775"), 31 "river": "PawneeRiver", 32 "sensor": 1 33 "timestamp": "2020-05-02T00:00:00Z", 34 "water-level": 63.12, 35 "water-temperature": 72.8 36 }, 37 { 38 "_id": ObjectId("5eac644c64faf3ff31d90776"), 39 "river": "PawneeRiver", 40 "sensor": 2 41 "timestamp": "2020-05-02T00:00:00Z", 42 "water-level": 63.11, 43 "water-temperature": 72.7 44 }, 45 ... 46 { 47 "_id": ObjectId("5eac644c64faf3ff31d9079c"), 48 "river": "WamapokeRiver", 49 "sensor": 90 50 "timestamp": "2020-05-02T23:59:00Z", 51 "water-level": 71.58, 52 "water-temperature": 66.2 53 }
Let's say that Leslie wants to be able to easily query on the
river
and sensor
fields, so she creates an index on each field.If Leslie were to store hourly data throughout all of 2019 and create two indexes in each collection (in addition to the default index on
_id
), her database would have the following stats:- Database size: 5.2 GB
- Index size: 1.07 GB
- Total Collections: 365
Each day she creates a new collection and two indexes. As Leslie continues to collect data and her number of collections exceeds 10,000, the performance of her database will decline.
Also, when Leslie wants to look for trends across weeks and months, she'll have a difficult time doing so since her data is spread across multiple collections.
Let's say Leslie realizes this isn't a great schema, so she decides to restructure her data. This time, she decides to keep all of her data in a single collection. She'll bucket her information, so she stores one hour's worth of information from one sensor in each document.
1 // data collection 2 { 3 "_id": "PawneeRiver-1-2019-05-01T00:00:00.000Z", 4 "river": "PawneeRiver", 5 "sensor": 1, 6 "readings": [ 7 { 8 "timestamp": "2019-05-01T00:00:00.000+00:00", 9 "water-level": 61.56, 10 "water-temperature": 72.1 11 }, 12 { 13 "timestamp": "2019-05-01T00:01:00.000+00:00", 14 "water-level": 61.56, 15 "water-temperature": 72.1 16 }, 17 ... 18 { 19 "timestamp": "2019-05-01T00:59:00.000+00:00", 20 "water-level": 61.55, 21 "water-temperature": 72.0 22 } 23 ] 24 }, 25 ... 26 { 27 "_id": "PawneeRiver-1-2019-05-02T00:00:00.000Z", 28 "river": "PawneeRiver", 29 "sensor": 1, 30 "readings": [ 31 { 32 "timestamp": "2019-05-02T00:00:00.000+00:00", 33 "water-level": 63.12, 34 "water-temperature": 72.8 35 }, 36 { 37 "timestamp": "2019-05-02T00:01:00.000+00:00", 38 "water-level": 63.11, 39 "water-temperature": 72.8 40 }, 41 ... 42 { 43 "timestamp": "2019-05-02T00:59:00.000+00:00", 44 "water-level": 63.10, 45 "water-temperature": 72.7 46 } 47 ] 48 } 49 ...
Leslie wants to query on the
river
and sensor
fields, so she creates two new indexes for this collection.If Leslie were to store hourly data for all of 2019 using this updated schema, her database would have the following stats:
- Database size: 3.07 GB
- Index size: 27.45 MB
- Total Collections: 1
By restructuring her data, she sees a massive reduction in her index size (1.07 GB initially to 27.45 MB!). She now has a single collection with three indexes.
With this new schema, she can more easily look for trends in her data because it's stored in a single collection. Also, she's using the default index on
_id
to her advantage by storing the hour the water level data was gathered in this field. If she wants to query by hour, she already has an index to allow her to efficiently do so.For more information on modeling time-series data in MongoDB, see Building with Patterns: The Bucket Pattern.
In the example above, Leslie was able to remove unnecessary collections by changing how she stored her data.
Sometimes, you won't immediately know what collections are unnecessary, so you'll have to do some investigating yourself. If you find an empty collection, you can drop it. If you find a collection whose size is made up mostly of indexes, you can probably move that data into another collection and drop the original. You might be able to use $merge to move data from one collection to another.
Below are a few ways you can begin your investigation.
If your database is hosted in Atlas, navigate to the Atlas Data Explorer. The Data Explorer allows you to browse a list of your databases and collections. Additionally, you can get stats on your database including the database size, index size, and number of collections.
If you are using an M10 cluster or larger on Atlas, you can also use the Real-Time Performance Panel to check if your application is actively using a collection you're considering dropping.
Regardless of where your MongoDB database is hosted, you can use MongoDB Compass, MongoDB's desktop GUI. Similar to the Data Explorer, you can browse your databases and collections so you can check for unused collections. You can also get stats at the database and collection levels.
If you prefer working in a terminal instead of a GUI, connect to your database using the Mongo Shell.
To see a list of collections, run
db.getCollectionNames()
. Output like the following will be displayed:1 [ 2 "2019-01-01", 3 "2019-01-02", 4 "2019-01-03", 5 "2019-01-04", 6 "2019-01-05", 7 ... 8 ]
To retrieve stats about your database, run
db.stats()
. Output like the following will be displayed:1 { 2 "db" : "riverstats", 3 "collections" : 365, 4 "views" : 0, 5 "objects" : 47304000, 6 "avgObjSize" : 118, 7 "dataSize" : 5581872000, 8 "storageSize" : 1249677312, 9 "numExtents" : 0, 10 "indexes" : 1095, 11 "indexSize" : 1145790464, 12 "scaleFactor" : 1, 13 "fsUsedSize" : 5312217088, 14 "fsTotalSize" : 10726932480, 15 "ok" : 1, 16 "$clusterTime" : { 17 "clusterTime" : Timestamp(1588795184, 3), 18 "signature" : { 19 "hash" : BinData(0,"orka3bVeAiwlIGdbVoP+Fj6N01s="), 20 "keyId" : NumberLong("6821929184550453250") 21 } 22 }, 23 "operationTime" : Timestamp(1588795184, 3) 24 }
You can also run
db.collection.stats()
to see information about a particular collection.Be mindful of creating a massive number of collections as each collection likely has a few indexes associated with it. An excessive number of collections and their associated indexes can drain resources and impact your database's performance. In general, try to limit your replica set to 10,000 collections.
Come back soon for the next post in this anti-patterns series!
When you're ready to build a schema in MongoDB, check out MongoDB Atlas, MongoDB's fully managed database as a service. Atlas is the easiest way to get started with MongoDB. With a forever-free tier, you're on your way to realizing the full value of MongoDB.
Check out the following resources for more information: