Operational Factors and Data Models
On this page
Modeling application data for MongoDB should consider various operational factors that impact the performance of MongoDB. For instance, different data models can allow for more efficent queries, increase the throughput of insert and update operations, or distribute activity to a sharded cluster more effectively.
When developing a data model, analyze all of your application's read and write operations in conjunction with the following considerations.
Atomicity
In MongoDB, a write operation is atomic on the level of a single
document, even if the operation modifies multiple embedded documents
within a single document. When a single write operation modifies
multiple documents (e.g. db.collection.updateMany()
), the
modification of each document is atomic, but the operation as a whole
is not atomic.
Embedded Data Model
The embedded data model combines all related data in a single document instead of normalizing across multiple documents and collections. This data model facilitates atomic operations.
See Model Data for Atomic Operations for an example data model that provides atomic updates for a single document.
Multi-Document Transaction
For data models that store references between related pieces of data, the application must issue separate read and write operations to retrieve and modify these related pieces of data.
For situations that require atomicity of reads and writes to multiple documents (in a single or multiple collections), MongoDB supports distributed transactions, including transactions on replica sets and sharded clusters.
For more information, see transactions
Important
In most cases, a distributed transaction incurs a greater performance cost over single document writes, and the availability of distributed transactions should not be a replacement for effective schema design. For many scenarios, the denormalized data model (embedded documents and arrays) will continue to be optimal for your data and use cases. That is, for many scenarios, modeling your data appropriately will minimize the need for distributed transactions.
For additional transactions usage considerations (such as runtime limit and oplog size limit), see also Production Considerations.
Sharding
MongoDB uses sharding to provide horizontal scaling. These
clusters support deployments with large data sets and high-throughput
operations. Sharding allows users to partition a
collection within a database to distribute the collection's
documents across a number of mongod
instances or
shards.
To distribute data and application traffic in a sharded collection, MongoDB uses the shard key. Selecting the proper shard key has significant implications for performance, and can enable or prevent query isolation and increased write capacity. It is important to consider carefully the field or fields to use as the shard key.
See Sharding and Shard Keys for more information.
Indexes
Use indexes to improve performance for common queries. Build indexes on
fields that appear often in queries and for all operations that return
sorted results. MongoDB automatically creates a unique index on the
_id
field.
As you create indexes, consider the following behaviors of indexes:
Each index requires at least 8 kB of data space.
Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
Collections with high read-to-write ratio often benefit from additional indexes. Indexes do not affect un-indexed read operations.
When active, each index consumes disk space and memory. This usage can be significant and should be tracked for capacity planning, especially for concerns over working set size.
See Indexing Strategies for more information on indexes as well as Analyze Query Performance. Additionally, the MongoDB database profiler may help identify inefficient queries.
Large Number of Collections
In certain situations, you might choose to store related information in several collections rather than in a single collection.
Consider a sample collection logs
that stores log documents for
various environment and applications. The logs
collection contains
documents of the following form:
{ log: "dev", ts: ..., info: ... } { log: "debug", ts: ..., info: ...}
If the total number of documents is low, you may group documents into
collection by type. For logs, consider maintaining distinct log
collections, such as logs_dev
and logs_debug
. The logs_dev
collection would contain only the documents related to the dev
environment.
Generally, having a large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing.
When using models that have a large number of collections, consider the following behaviors:
Each collection has a certain minimum overhead of a few kilobytes.
Each index, including the index on
_id
, requires at least 8 kB of data space.For each database, a single namespace file (i.e.
<database>.ns
) stores all meta-data for that database, and each index and collection has its own entry in the namespace file. See places namespace length limits for specific limitations.
Collection Contains Large Number of Small Documents
You should consider embedding for performance reasons if you have a collection with a large number of small documents. If you can group these small documents by some logical relationship and you frequently retrieve the documents by this grouping, you might consider "rolling-up" the small documents into larger documents that contain an array of embedded documents.
"Rolling up" these small documents into logical groupings means that queries to retrieve a group of documents involve sequential reads and fewer random disk accesses. Additionally, "rolling up" documents and moving common fields to the larger document benefit the index on these fields. There would be fewer copies of the common fields and there would be fewer associated key entries in the corresponding index. See Indexes for more information on indexes.
However, if you often only need to retrieve a subset of the documents within the group, then "rolling-up" the documents may not provide better performance. Furthermore, if small, separate documents represent the natural model for the data, you should maintain that model.
Storage Optimization for Small Documents
Each MongoDB document contains a certain amount of overhead. This overhead is normally insignificant but becomes significant if all documents are just a few bytes, as might be the case if the documents in your collection only have one or two fields.
Consider the following suggestions and strategies for optimizing storage utilization for these collections:
Use the
_id
field explicitly.MongoDB clients automatically add an
_id
field to each document and generate a unique 12-byte ObjectId for the_id
field. Furthermore, MongoDB always indexes the_id
field. For smaller documents this may account for a significant amount of space.To optimize storage use, users can specify a value for the
_id
field explicitly when inserting documents into the collection. This strategy allows applications to store a value in the_id
field that would have occupied space in another portion of the document.You can store any value in the
_id
field, but because this value serves as a primary key for documents in the collection, it must uniquely identify them. If the field's value is not unique, then it cannot serve as a primary key as there would be collisions in the collection.Use shorter field names.
Note
Shortening field names reduces expressiveness and does not provide considerable benefit for larger documents and where document overhead is not of significant concern. Shorter field names do not reduce the size of indexes, because indexes have a predefined structure.
In general, it is not necessary to use short field names.
MongoDB stores all field names in every document. For most documents, this represents a small fraction of the space used by a document; however, for small documents the field names may represent a proportionally large amount of space. Consider a collection of small documents that resemble the following:
{ last_name : "Smith", best_score: 3.9 } If you shorten the field named
last_name
tolname
and the field namedbest_score
toscore
, as follows, you could save 9 bytes per document.{ lname : "Smith", score : 3.9 } Embed documents.
In some cases you may want to embed documents in other documents and save on the per-document overhead. See Collection Contains Large Number of Small Documents.
Data Lifecycle Management
Data modeling decisions should take data lifecycle management into consideration.
The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if your application requires some data to persist in the database for a limited period of time.
Additionally, if your application only uses recently inserted documents, consider Capped Collections. Capped collections provide first-in-first-out (FIFO) management of inserted documents and efficiently support operations that insert and read documents based on insertion order.