We have Mongo DB replicaset configured and have the sharding implemented as well. The system has been running for more than 2 years now. Suddenly from the last 3 days we are coming across situations where our secondary servers are crashing and all our load is on the Primary server and in last it also crashed and caused the whole application outrage.
We are using MongoDB server version: 4.2.1 and the Node application is connecting to mongos instance using mongoose library 5.1.0
When we are seeing the mongo log we found that number of connection was increasing, it reached to 350 connection like
2021-07-01T16:11:08.923+0000 I NETWORK [listener] connection accepted from 111.111.111.11:42314 #165002 (329 connections now open)
2021-07-01T16:11:08.923+0000 I NETWORK [listener] connection accepted from 111.111.111.11:42316 #165003 (330 connections now open)
2021-07-01T16:11:08.923+0000 I NETWORK [conn165002] received client metadata from 111.111.111.11:42314 conn165002: { driver: { name: "NetworkInterfaceTL", version: "4.2.1" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "18.04" } }
2021-07-01T16:11:08.923+0000 I NETWORK [conn165003] received client metadata from 111.111.111.11:42316 conn165003: { driver: { name: "NetworkInterfaceTL", version: "4.2.1" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "18.04" } }
2021-07-01T16:11:08.924+0000 I NETWORK [listener] connection accepted from 111.111.111.11:42318 #165004 (331 connections now open)
2021-07-01T16:11:08.924+0000 I NETWORK [listener] connection accepted from 111.111.111.11:42320 #165005 (332 connections now open)
2021-07-01T16:11:08.925+0000 I NETWORK [conn165004] received client metadata from 111.111.111.11:42318 conn165004: { driver: { name: "NetworkInterfaceTL", version: "4.2.1" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "18.04" } }
2021-07-01T16:11:08.925+0000 I NETWORK [conn165005] received client metadata from 111.111.111.11:42320 conn165005: { driver: { name: "NetworkInterfaceTL", version: "4.2.1" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "18.04" } }
2021-07-01T16:11:08.925+0000 I NETWORK [listener] connection accepted from 111.111.111.11:42322 #165006 (333 connections now open)
2021-07-01T16:11:08.925+0000 I NETWORK [listener] connection accepted from 111.111.111.11:42324 #165007 (334 connections now open)
and finally there are message like before the Mongo server crashed
2021-07-01T16:11:11.558+0000 W NETWORK [listener] Error accepting new connection TooManyFilesOpen: error in creating eventfd: Too many open files
2021-07-01T16:11:11.558+0000 W NETWORK [listener] Error accepting new connection TooManyFilesOpen: error in creating eventfd: Too many open files
2021-07-01T16:11:11.827+0000 W NETWORK [listener] Error accepting new connection TooManyFilesOpen: error in creating eventfd: Too many open files
2021-07-01T16:11:11.827+0000 W NETWORK [listener] Error accepting new connection TooManyFilesOpen: error in creating eventfd: Too many open files
2021-07-01T16:11:12.014+0000 W NETWORK [listener] Error accepting new connection TooManyFilesOpen: error in creating eventfd: Too many open files
2021-07-01T16:11:12.014+0000 W NETWORK [listener] Error accepting new connection TooManyFilesOpen: error in creating eventfd: Too many open files
Please let us know how to identify the root cause and also if you have any suggestions for us to improve anything in our Mongo infrastructure to avoid the outage in future.
Additionally please find below the output of ulimit -a command
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 254249
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 254249
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited