Additional important metrics
Below, we’ve identified a number of important metrics used to measure performance, sorted by instance status and health, cluster operation and connection metrics, instance hardware metrics, and replication metrics.
Instance status and health
The status of a MongoDB server process can be an immediate indication of whether we need to drill down into its activity or health. A process that is unresponsive or does not answer to our commands should be immediately investigated.
- Monitor with MongoDB Atlas: Cluster health and process health can be seen via the Cluster view. Green dots means a healthy state, while orange and red mean there are issues with the process.
- Monitor self-managed MongoDB instances: Commands such as rs.status() for replica sets and sh.status() for sharded clusters provide a high level status of the cluster.
Cluster operation and connection metrics
When your application is struggling or underperforming, you need to rule out the database layer as the bottleneck. The application issues connections and operations against the database, so pay close attention to its behavior.
MongoDB provides various metrics and mechanisms to identify its connection and operation patterns. On top of the active and proactive monitoring tools, Atlas provides a full alerting system and log gathering.
- Monitor with MongoDB Atlas: Atlas provides built-in features like Performance Advisor, Real-Time Performance Panel, Namespace Insights, and Query Profiler to track operations and highlight slow/heavy spotted operations. Additionally, the Metrics tab provides many graphs that plot operations and number of connections. See below for more details:
Metric | Definition | Importance | Signals |
Opcounters
| Tracks the number of operations performed by the database, including inserts, updates, deletes, and queries.
| Provides insight into the overall workload and performance of the database, helping to identify bottlenecks or performance issues.
| Good: Steady or increasing counts.
Bad: Sudden drops or stagnation may suggest issues with the database or application, such as connection problems or resource constraints.
|
Operation execution time
| Average time taken to execute database operations, measured in milliseconds.
| Offers a critical indicator of performance–longer execution times can lead to slower application response.
| Good: Low and stable execution times.
Bad: Increasing execution times may signal performance degradation, potentially due to resource contention or inefficient queries.
|
Query executors
| The number of active query executors currently processing queries in the database.
| Helps assess the database's ability to handle concurrent queries and overall query performance.
| Good: A balanced number of executors.
Bad: High numbers may indicate contention or resource exhaustion, while low numbers could suggest underutilization or query bottlenecks.
|
Query targeting
| Ratio of the number of index items scanned to the number of documents returned by queries, since the previous data point for the selected sample period.
| Best measure on how efficiently the database is running.
| Good: For frequently run queries, aim for as low a value as possible. For example, a query targeting ratio of 1 is the most efficient query as it returns 1 document for each document it scans.
Bad: Spikes arise when there are no appropriate indexes to support queries.
|
Connections
| Total number of active connections to the database at any given time.
| Understand the load on the database and ensure it can handle the required number of concurrent users.
| Good: Connection counts within expected limits.
Bad: High connection counts may lead to resource exhaustion, while very low counts could suggest underutilization or application issues.
|
Queues
| Number of operations waiting to be processed by the database, indicating the level of demand versus capacity.
| Identify potential bottlenecks and ensure the database can handle incoming requests efficiently.
| Good: Low or no queues indicate that the database is processing requests promptly.
Bad: High queue lengths suggest that the database is overwhelmed, leading to increased latency and potential timeouts.
|
Scan and order
| Average rate of in-memory sort operations performed by the database per second.
| In-memory sorts can be very expensive as they require large result sets to be buffered.
| Good: 0, database didn't perform any in-memory sort operations.
Bad: a large value indicates the database performed many in-memory sort operations.
|
- Monitor with self-managed MongoDB instances:
- You can leverage tools like mongostat and mongotop.
- Once you connect via Compass to your instance, you can use the MongoDB Compass Performance Tab, which is similar to Atlas RealTime Performance panel.
Instance hardware metrics
Hardware metrics can be used to identify which resources could be the root cause for performance issues or which need tuning and capacity re-planning.
- Monitor with MongoDB Atlas:
- The Atlas metrics tab within a cluster provides plotted graphs for the hardware metrics. These allow you to correlate them with other database metrics. See below for more details:
Metric | Definition | Importance | Signals |
Normalized system CPU
| The CPU usage of all processes on the node, scaled to a range of 0-100% by dividing by the number of CPU cores.
| Helps determine if the correct cluster tier is in use. An improper tier can lead to higher costs if overprovisioned or potential downtime if underprovisioned.
| Good: A healthy range is between 40% and 70%.
Bad: Under 40% indicates potential overprovisioning, while over 70% indicates potential underprovisioning.
|
Normalized process CPU
| The percentage of CPU resources utilized by the database process, normalized to account for the number of CPU cores available.
| Indicates how efficiently the database is using CPU resources, helping to identify potential performance bottlenecks.
| Good: Values around 50-70%.
Bad: Values consistently above 80% may indicate CPU contention, while very low values could suggest underutilization.
|
Disk latency
| Average time taken for read and write operations on the disk, measured in milliseconds.
| Critical measure of disk performance–high latency can significantly impact database performance and user experience.
| Good: Low latency values (typically under 5ms).
Bad: High latency (over 20ms) can signal disk bottlenecks or issues with the underlying storage infrastructure.
|
Disk IOPS
| Number of input/output operations per second that the disk can handle.
| Clarifies the disk's ability to support database workloads, especially for read/write-heavy applications.
| Good: High IOPS values (above 500 IOPS).
Bad: Low IOPS (below 100 IOPS) may suggest that the disk is a bottleneck, potentially leading to performance degradation.
|
Disk space free
| Amount of available disk space on the storage system used by the database.
| Ensures that there is sufficient space for data growth, backups, and operational efficiency.
| Good: Above 20% of total capacity.
Bad: Below 10% capacity can lead to performance issues, data loss, or inability to perform necessary operations.
|
System memory
| Total amount of RAM being used by the database process compared to the total available system memory.
| Critical for performance–adequate memory usage can reduce disk I/O and improve query response times.
| Good: Moderate memory usage, around 60-80%.
Bad: High memory usage (over 90%) may indicate potential memory pressure, while very low usage could suggest underutilization.
|
Swap usage
| Amount of disk space being used as virtual memory when the system runs out of physical RAM.
| High swap usage can indicate insufficient memory, leading to performance degradation as the system relies on slower disk storage.
| Good: Low swap usage, ideally 0-5%.
Bad: High swap usage (over 10%) can lead to significant performance issues, suggesting a need for additional memory resources.
|
- Monitor with General MongoDB instances:
- Use your operating system tools (top, iostat, etc.).
Replication metrics
Replication is a key aspect of MongoDB clusters' high availability and durability. The health and performance of replication needs to be carefully monitored in order to maintain a healthy cluster.
- Monitor with MongoDB Atlas: The Atlas metrics tab within a cluster provides plotted graphs for the replication metrics. Replication metrics allow you to correlate them with other database metrics. See below for more details:
Metric | Definition | Importance | Signals |
Replication lag
| Time delay between the primary and secondary nodes in a replica set, measured in seconds.
| Indicates how current the secondary nodes are compared to the primary, affecting data consistency and availability.
| Good: Low lag, typically under 5 seconds.
Bad: High lag (over 10 seconds) can lead to stale reads and potential data loss during failover.
|
Replication oplog window
| Time span of data that can be replayed from the oplog on the primary node, measured in seconds.
| Ensures that secondaries can catch up with the primary; a short window may lead to data loss if a secondary falls too far behind.
| Good: A longer oplog window of several hours.
Bad: A short window (under 1 hour) can risk data loss if a secondary is unable to catch up.
|
Replication headroom
| Amount of time the oplog can sustain the secondary nodes without new data being written to the primary.
| Provides insight into how long secondaries can remain disconnected without falling behind.
| Good: Ample headroom of several hours.
Bad: Limited headroom (under 30 minutes) suggests a risk of data loss if the primary becomes unavailable.
|
Oplog GB/hour
| Amount of data written to the oplog per hour, measured in gigabytes.
| Assesses the write load on the primary and the capacity of the oplog to handle data changes.
| Good: Moderate values of 1-5 GB/hour.
Bad: High values of over 10 GB/hour may suggest that the oplog could fill up quickly, risking data loss for secondaries.
|
Opcounters – repl
| Counters that track the number of replication operations (inserts, updates, deletes) performed by the database.
| Provides insight into the replication workload and helps identify potential bottlenecks in the replication process.
| Good: Steady or increasing counts.
Bad: Sudden drops or stagnation may suggest issues with replication, such as network problems or resource constraints.
|
- Monitor with General MongoDB instances: Use the usage of the following MongoDB Commands:
MongoDB provides built-in UI tools in Atlas as well as Cloud Manager and Ops Manager to help you monitor performance. MongoDB also offers some standalone tools and commands to look at more raw-based data.
Below are tools you can run from a host, which has access and appropriate roles (clusterMonitor) to monitor your environment.
mongostat
mongostat is used to get a quick overview of the status of your MongoDB server instance. It’s best used for watching a single instance for a specific event as it provides a real-time view. You can use this command to monitor basic server statistics such as operation breakdown, MongoDB memory statistics, lock queues, and connections/network.
You can execute the MongoDB command through the following syntax:
See example output here.
mongotop
mongotop tracks the amount of time a MongoDB instance spends reading and writing data per collection.
You can execute the MongoDB command through the following syntax:
See example output here.
rs.status() returns the replica set status. It is done from the point of view of the member where the method is run.
See example output here.
db.serverStatus()
db.serverStatus() provides a document representing the current instance metrics counters. Run this command at a regular interval to collect statistics about the instance.
See example output here.
dbStats
dbStats command returns the storage statistics, such as the total collection data versus storage size, number of indexes and their size, and collection-related statistics (number of documents and collections), for a certain database.
See example output here.
collStats
collStats command is used to collect statistics similar to that provided by dbStats on the collection level. Its output includes a count of the objects in the collection, the collection’s size, the amount of disk space consumed by the collection, and information concerning its indexes for a given collection.
See example output here.
We can monitor MongoDB databases by using different tools like mongostat, mongotop, dbStats, collStats, and serverStatus commands. These commands provide real-time monitoring and reporting of the database server, allowing us to monitor errors and database performance and assist in informed decision making to optimize a database.
Summary
MongoDB provides a variety of metrics and tools to monitor your database and ensure it's running at optimal performance. From UI tools to advisors to raw-data metrics, you're covered whether you're hosting your database yourself or using MongoDB Atlas.
For more information on monitoring MongoDB databases, see the following resources.
References:
MongoDB Atlas Monitoring
MongoDB Performance
MongoDB Performance Best Practices
MongoDB Professional Services
Appendix
Cloud Manager
Cluster view
Compass Performance Tab
Connections
Cost Explorer
db.serverStatus()
dbStats
Disk IOPS
Disk latency
Disk space free
collStats
Metrics tab
mongostat
mongotop
Namespace Insights
Normalized process CPU
Normalized system CPU
Oplog GB/hour
Opcounters
Opcounters – repl
Operation execution time
Ops Manager
Performance Advisor
Queues
Query executors
Query Profiler
Query targeting
Real-Time Performance Panel
Replication headroom
Replication lag
Replication oplog window
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()
rs.status()
Scan and order
Swap usage
System memory