Hi MongoDB Community,
I hope this message finds you well. I wanted to raise awareness of a critical issue we’ve been experiencing in our PSA architecture, particularly when one of the secondary nodes goes down.
The problem lies in the WiredTigerHS.wt file, which undergoes excessive growth, eventually occupying the entire disk space and causing outages. I’ve raised a Mongo JIRA server ticket (https://jira.mongodb.org/browse/SERVER-84108) to address this issue. However, the Mongo Team is currently facing challenges in allocating resources to work on it.
To help expedite the resolution, I’d like to provide a straightforward set of steps to reproduce the problem:
Set up a PSA replicSet.
Create approximately 20,000 records with a reasonable payload.
Bring down one of the secondary nodes.
Perform a bulkUpdate for the 20,000 records, as demonstrated below:
db.session.updateMany({}, { $set: { status: "Modified" }}) db.session.updateMany({}, { $set: { status2: "Modified" }})
Repeat the update process a few times.
Upon observation, you’ll notice that the WiredTigerHS.wt file grows significantly with each update.
After going through the mongo code, i could figure out WT Engine RunTime config history_store.file_max parameter, which sets the maximum file size. However, this approach triggers a PANIC and restarts mongod when the file size exceeds this value. Consequently, there is no effective control mechanism to prevent the disk consumption problem.
mongo host1:27717 --eval 'db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "history_store=(file_max=104857600)"})'
I’ve sought assistance through the JIRA ticket, but due to the lack of response from the Mongo Team, I’m reaching out to the community for additional insights or potential solutions. It’s crucial to highlight that the minSnapshotHistoryWindowInSeconds parameter also doesn’t seem to make any difference.
Any guidance or assistance regarding this issue would be greatly appreciated.
Thanks,
Venkataraman
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"-", "id":23089, "ctx":"thread61450","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":574}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"-", "id":23090, "ctx":"thread61450","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"E", "c":"STORAGE", "id":22435, "ctx":"thread61451","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1707172787:938206][3050376:0x7fe300ded700], file:collection-0--3783294461088769059.wt, eviction-server: __wt_hs_insert_updates, 804: WiredTigerHS: file size of 106291200 exceeds maximum size 104857600: WT_PANIC: WiredTiger library panic"}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"CONTROL", "id":6384300, "ctx":"thread61450","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"E", "c":"STORAGE", "id":22435, "ctx":"thread61452","msg":"WiredTiger error","attr":{"error":-31804,"message":"[1707172787:938428][3050376:0x7fe3005ec700], file:collection-0--3783294461088769059.wt, eviction-server: __wt_hs_insert_updates, 804: WiredTigerHS: file size of 106291200 exceeds maximum size 104857600: WT_PANIC: WiredTiger library panic"}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"-", "id":23089, "ctx":"thread61452","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":574}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"-", "id":23090, "ctx":"thread61452","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"-", "id":23089, "ctx":"thread61451","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":574}}
{"t":{"$date":"2024-02-05T22:39:47.938+00:00"},"s":"F", "c":"-", "id":23090, "ctx":"thread61451","msg":"\n\n***aborting after fassert() failure\n\n"}
{"t":{"$date":"2024-02-05T22:39:48.056+00:00"},"s":"I", "c":"CONNPOOL", "id":22576, "ctx":"MirrorMaestro","msg":"Connecting","attr":{"hostAndPort":"sessionmgr03:27717"}}
{"t":{"$date":"2024-02-05T22:39:48.168+00:00"},"s":"I", "c":"CONTROL", "id":31380, "ctx":"thread61450","msg":"BACKTRACE","attr":{"bt":{"backtrace":[{"a":"5623FD042365","b":"5623F90C6000","o":"3F7C365","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.361","s+":"215"},{"a":"5623FD044DE9","b":"5623F90C6000","o":"3F7EDE9","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"5623FD03D206","b":"5623F90C6000","o":"3F77206","s":"abruptQuit","s+":"66"},{"a":"7FE308F21D10","b":"7FE308F0F000","o":"12D10","s":"funlockfile","s+":"50"},{"a":"7FE308B98ACF","b":"7FE308B4A000","o":"4EACF","s":"gsignal","s+":"10F"},{"a":"7FE308B6BEA5","b":"7FE308B4A000","o":"21EA5","s":"abort","s+":"127"},{"a":"5623FA4DAAB9","b":"5623F90C6000","o":"1414AB9","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","s+":"F6"},{"a":"5623F9FB2388","b":"5623F90C6000","o":"EEC388","s":"_ZN5mongo12_GLOBAL__N_141mdb_handle_error_with_startup_suppressionEP18__wt_event_handlerP12__wt_sessioniPKc.cold.1149","s+":"16"},{"a":"5623FA7EF083","b":"5623F90C6000","o":"1729083","s":"__eventv","s+":"403"},{"a":"5623F9FC49CD","b":"5623F90C6000","o":"EFE9CD","s":"__wt_panic_func","s+":"BB"},{"a":"5623F9FD0586","b":"5623F90C6000","o":"F0A586","s":"__wt_hs_insert_updates.cold.11","s+":"55"},{"a":"5623FA7CE218","b":"5623F90C6000","o":"1708218","s":"__rec_write_wrapup","s+":"398"},{"a":"5623FA7CFACA","b":"5623F90C6000","o":"1709ACA","s":"__wt_reconcile","s+":"6DA"},{"a":"5623FA79CFC5","b":"5623F90C6000","o":"16D6FC5","s":"__wt_evict","s+":"1935"},{"a":"5623FA793762","b":"5623F90C6000","o":"16CD762","s":"__evict_page","s+":"6A2"},{"a":"5623FA794028","b":"5623F90C6000","o":"16CE028","s":"__evict_lru_pages","s+":"78"},{"a":"5623FA798E14","b":"5623F90C6000","o":"16D2E14","s":"__wt_evict_thread_run","s+":"74"},{"a":"5623FA7FFE09","b":"5623F90C6000","o":"1739E09","s":"__thread_run","s+":"39"},{"a":"7FE308F171CA","b":"7FE308F0F000","o":"81CA","s":"start_thread","s+":"EA"},{"a":"7FE308B83E73","b":"7FE308B4A000","o":"39E73","s":"clone","s+":"43"}],"processInfo":{"mongodbVersion":"5.0.20","gitVersion":"2cd626d8148120319d7dca5824e760fe220cb0de","compiledModules":[],"uname":{"sysname":"Linux","release":"4.18.0-477.27.1.el8_8.x86_64","version":"#1 SMP Thu Sep 21 06:49:25 EDT 2023","machine":"x86_64"},"somap":[{"b":"5623F90C6000","elfType":3,"buildId":"A8EA7166EFC23E0D3802F8AFEFEF2186CF5E5BBD"},{"b":"7FE308F0F000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"76F163FDBAA9E91050B456A7E5EA8AC78563BD29"},{"b":"7FE308B4A000","path":"/lib64/libc.so.6","elfType":3,"buildId":"44ED73CF68E8FA608DA3B301146C81A0A77A5619"}]}}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FD042365","b":"5623F90C6000","o":"3F7C365","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.361","s+":"215"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FD044DE9","b":"5623F90C6000","o":"3F7EDE9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FD03D206","b":"5623F90C6000","o":"3F77206","s":"abruptQuit","s+":"66"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308F21D10","b":"7FE308F0F000","o":"12D10","s":"funlockfile","s+":"50"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308B98ACF","b":"7FE308B4A000","o":"4EACF","s":"gsignal","s+":"10F"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308B6BEA5","b":"7FE308B4A000","o":"21EA5","s":"abort","s+":"127"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA4DAAB9","b":"5623F90C6000","o":"1414AB9","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","s+":"F6"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623F9FB2388","b":"5623F90C6000","o":"EEC388","s":"_ZN5mongo12_GLOBAL__N_141mdb_handle_error_with_startup_suppressionEP18__wt_event_handlerP12__wt_sessioniPKc.cold.1149","s+":"16"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7EF083","b":"5623F90C6000","o":"1729083","s":"__eventv","s+":"403"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623F9FC49CD","b":"5623F90C6000","o":"EFE9CD","s":"__wt_panic_func","s+":"BB"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623F9FD0586","b":"5623F90C6000","o":"F0A586","s":"__wt_hs_insert_updates.cold.11","s+":"55"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7CE218","b":"5623F90C6000","o":"1708218","s":"__rec_write_wrapup","s+":"398"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7CFACA","b":"5623F90C6000","o":"1709ACA","s":"__wt_reconcile","s+":"6DA"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA79CFC5","b":"5623F90C6000","o":"16D6FC5","s":"__wt_evict","s+":"1935"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA793762","b":"5623F90C6000","o":"16CD762","s":"__evict_page","s+":"6A2"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA794028","b":"5623F90C6000","o":"16CE028","s":"__evict_lru_pages","s+":"78"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA798E14","b":"5623F90C6000","o":"16D2E14","s":"__wt_evict_thread_run","s+":"74"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"5623FA7FFE09","b":"5623F90C6000","o":"1739E09","s":"__thread_run","s+":"39"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308F171CA","b":"7FE308F0F000","o":"81CA","s":"start_thread","s+":"EA"}}}
{"t":{"$date":"2024-02-05T22:39:48.184+00:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"thread61450","msg":"Frame","attr":{"frame":{"a":"7FE308B83E73","b":"7FE308B4A000","o":"39E73","s":"clone","s+":"43"}}}
{"t":{"$date":"2024-02-05T22:40:01.827+00:00"},"s":"I", "c":"CONTROL", "id":20698, "ctx":"-","msg":"***** SERVER RESTARTED *****"}
`