5 Cloud Manager Monitoring Alerts to Keep Your MongoDB Deployment on Track
td {
padding:20px;
}
Editor's note: This post was edited on June 24, 2015 to reflect the relaunch of MongoDB Management Service to MongoDB Cloud Manager. Learn more
here
.
MongoDB Cloud Manager
is a cloud-based suite of services for managing MongoDB deployments, providing both monitoring and backup capabilities. In this post, we'll outline 5 alerts you should set up in Cloud Manager to keep your MongoDB deployment on track. We’ll explore what each alert means for a MongoDB instance, as well as how to calibrate the alert triggers to be relevant to your environment.
The goal here is to make this a repeatable process for any and all alerts. First let’s describe the general validation process you should go through when picking an alert:
Is there an absolute limit to alert on, regardless of context? (usually the answer is no, or if it is yes, you will generally want to know at some point prior to this happening)
Determine what is normal (baseline)
Determine what is worrying (warning)
Determine what is a definite problem (critical)
Determine the likelihood of false positives
The answers to these question will dictate how to alert on something, the severity of that alert and whether it is worth alerting on.
Each environment is different and will have different requirements for uptime, responsiveness, IO, CPU etc. There is no magic formula that will fit everyone, or a short cut to truly relevant alerting. Getting to that point (without blind luck) always requires a mix of research, testing and tweaking over time.
5 Recommended Alerts
Host Recovering (All, but by definition Secondary)
Repl Lag (Secondary)
Connections (All mongos, mongod)
Lock % (Primary, Secondary)
Replica (Primary, Secondary)
1. Host Recovering
The first alert is a basic type and will send you an alert if any of you MongoDB instances enters RECOVERING mode (see
this page
for more information on states). This can be intentional, for example if you resync a secondary node, but if it happens outside of known work, then determining the cause and resolving the issue will be key to keeping a healthy replica set.
As such, there is no need to run through our general evaluation process. This is an example of that rare, definitive alert that everyone running a replica set should have as-is.
2. Repl Lag
Outside of a node with a
configured slaveDelay
, you do not want to have any of your nodes falling behind. This is particularly important if you are serving reads from your secondary nodes, since any lag will potentially present a vastly different view of data from a delayed node when compared to an up-to-date node.
This one is not as definitive as the first, so let’s go through the process:
Is there an absolute limit?
Yes, in an ideal world the absolute limit is relatively low - about 1 or 2 seconds. However, at this threshold there would be false positives, and there can be spikes due to a wayward statistical sample. Hence, I would recommend any lag over 240 seconds as an absolute threshold.
Normal
Lag is ideally 0, and for most sets without load issues, this will stay at 0 (barring aforementioned statistical anomalies).
Worrying
Any lag is potentially worrying, so if you can live with the spam you could set one alert low (
60% lock whereas a read heavy system may never see anything above 10%.
Worrying
Once you establish a baseline expectation, anything that doubles your lock percentage (unless it is really low, of course) should cause concern.
Critical
This is something of a judgement call, but the absolute limits mentioned above are a good starting point. Any lock percentage above 80% for a production system is certainly something you will want to know about as soon as possible.
5. Replica
This statistic is less obvious in terms of its name and purpose than some of the others we have looked at above. Replica is calculated by taking the timestamp of the most recent entry in an oplog, and subtracting the timestamp of the oldest entry in the oplog. This then represents what is commonly referred to as the “oplog window” - which is the amount of time, based on your current traffic, that it will take to completely roll over the oplog.
Hence this is directly derived from three factors: how many operations you are inserting into the oplog, what size those operations are, and the size of the oplog capped collection. A good rule of thumb is to take your normal maximum maintenance window (let’s say 8 hours) and then multiply it by 3 (24 hours) for safety. This should then be the minimum length of your oplog.
Resizing the oplog
is non-trivial, so you want to catch issues here as soon as possible and take remedial action before the value gets too low.
Is there an absolute limit?
Not really, there is an argument to be made that any value under an hour is too low, but that would usually represent a poor initial configuration choice. You generally want to catch a low replica value long before it hits a value below an hour.
Normal
Normality will depend on the factors mentioned. Similar to the connection rate, you will want to benchmark the range of replica values you get at low/normal/peak times.
Worrying
25% below your replica value during your peak traffic period in terms of usage would be a worrying value.
Critical
50% below your replica value during your peak traffic period should be considered critical, and actioned immediately.
Conclusion
The methods above are valid for evaluating any such alert in (or outside of) Cloud Manager. There may be a key metric that is not included above that surprises you (Queues for example). An alert can be added for any such metric in essentially the same way and, as with the levels you choose to alert on, every system will have the alerts that are most relevant to it.
As implied by the “worrying” versus “critical” evaluation, you are also free to have more than one alert, at different levels. Imagine an alert that only hits the ops team when lock percentage goes above 60% (perhaps as a prelude to a capacity discussion), and then an alert for a much wider audience when the more critical level of 80% is breached and all eyes are needed on the issue.
To get started with monitoring, alerting, and backup, you can try MongoDB Cloud Manager free for 30 days at
cloud.mongodb.com
.
Try Cloud Manager today
July 31, 2013