I have x3 replica set where each node is a VPS running pretty much only MongoDB. The usual CPU during a normal day hovers around the 50% mark, and Disk IO less than 10 blocks/s.
At a very distinct point in time last night, the CPU on my primary tripled to 150% and has stayed over 100% all day long. That’s bad enough, but simultaneously, the Disk IO has gone though the roof, hitting nearly 8,000 blocks per second and staying there. In the nearly 24 hour period since the event, it’s averaged over 6,000 blocks per second.
The mystery event doesn’t coincide with even the slightest increase in web traffic via the front end servers, nor any increased throughput of data from other servers to the database server.
I continue to evaluate whether some user behaviour caused (is causing) this, but can anyone suggest a way to analyse why these resources are still so elevated?
I have examined mongotop and also the db.currentOp console helper to no avail. Although I do get some slow queries, they don’t really explain why the disk IO is so astronomical.
Hi @timw, is your storage class NVMe SSD? We’ve encountered similar scenario and upgrading to local NVMe SSD has significantly improved performance with lower resource usage.
Thanks for the reply. Storage is definitely SSD (Linode). Unsure of details, but I’m unaware of any hardware upgrade. Normally this is a big deal and I get a maintenance window.
After much investigation I am none the wiser as to what exceptional event took place a few days ago, but after restarting all nodes in the replica set, the CPU and IO has settled down to normal.
What I do know is that I have MANY *.wt files, and when they’re all open the RAM gets used up and I start swapping. I know that this elevates CPU and disk IO, but under normal, daily operating usage this just doesn’t happen. It happens at startup, but the server had not been restarted. It had been running for nearly a year.
To attribute a spike of open files to user behaviour would require my entire user base hitting the site at once and demanding their data, including those who had long since abandoned ship. This is just very, very unlikely and other web/traffic stats show this did not happen.
I now wish I had checked lsof when I had the chance (before restart) so I’ll never know if this was the reason. And if it was the reason, why more files were open than usual.
I’m still keep to hear any suggestions, so I can dig deeper.
To be clear, I’m not asking for suggestions to increase performance in general. I’m asking what kind of event - not originating from user behaviour - could have caused this spike.
Hey @timw, any luck resolving this? We’re dealing with a similar issue where a passive replica has literally millions of .wt files, but the primary has only 330. We suspect this to be an issue with a health check process that drops and recreates a collection every minute, but we use an identical process on another server and don’t see this behaviour at all. (Identical file count on primary and replica.) We’re using MongoDB 3.6.
I resolved my problem with a good, old-fashioned reboot. The root cause in my case will forever be a mystery. I looked for evidence of a significant event, and found nothing.
If your replica is indeed creating files at a crazy rate, I’d expect to see elevated RAM too.
I’m not expert enough to advise you on your case, but I can say what I’d be inclined to do:
If I was confident that those million files were erroneous (i.e. expendable), I’d stop that replica, trash the data directory and restart it. If it did’t fix itself after that - and I had no other clues - I’d probably replace the host with a clean server.
As a side note, I’d also look to be upgrading from 3.6 as soon as practical. I know it’s a pain playing catch up without immediate benefit to your application, but the further behind you get the worse it becomes; especially as Mongo requires version upgrades to be done one minor version at a time.
Glad to hear you got your issue resolved. We’ve tried reboots with no success. It’s a large database so resyncing over the network has caused issues for us in the past. Last time we rebuilt the node we used a filesystem snapshot but that would obviously not fix this particular issue!
Incidentally, we did confirm that the problem was caused by the frequent drop/create collection process so we’ve fixed that and no longer see new .wt files being created. Now we just need to fix the issue with the number of files as it’s causing mongod startup issues! Looking at a few options here, but we kind of need to fix this before upgrading although that may be something we try.
My “solution” for this is (1) to have servers with way more RAM than I need for normal operation, (2) Increase vm.swappiness before every restart, and (3) offload inactive data to static .bson files. These workarounds allow me to more-or-less function.
The real solution is more long term, which is to either completely restructure my data model to use fewer collections, or move away from Mongo completely; whichever seems like less work.
For comparison, my database size on disk is only about 20GB, but I have ~20k collections. So syncing across network isn’t as big a problem as running out of RAM. No doubt, your trade-offs are different.