Compacting a topic, repeated log lines with "compaction failed with EBUSY"

N_J · December 3, 2024, 2:00pm

We’ve recently cleared a large quantity of documents from a collection (over 1 billion). We’d like to reclaim that diskspace, as we’re not likely to allow the collection to grow this large again.

I’ve run the compact command against the aforementioned collection on a secondary, and I’m watching the logs. I’m struggling to tell if it is progressing, and one thing that is standing out is it keeps logging the following line:

<snip>
{"t":{"$date":"2024-12-03T13:56:37.389+00:00"},"s":"I",  "c":"WTCMPCT",  "id":22430,   "ctx":"conn906278","msg":"WiredTiger message","attr":{"message":{"ts_sec":1733234197,"ts_usec":389015,"thread":"1:0x7f3370dfd640","session_dhandle_name":"file:collection-44-7455917669258853182.wt","session_name":"WT_SESSION.compact","category":"WT_VERB_COMPACT_PROGRESS","category_id":8,"verbose_level":"DEBUG","verbose_level_id":1,"msg":" compacting collection-44-7455917669258853182.wt for 11044 seconds; reviewed 174476 pages, skipped 174476 pages, rewritten 0pages"}}}
{"t":{"$date":"2024-12-03T13:56:37.390+00:00"},"s":"I",  "c":"WTCMPCT",  "id":22430,   "ctx":"conn906278","msg":"WiredTiger message","attr":{"message":{"ts_sec":1733234197,"ts_usec":390644,"thread":"1:0x7f3370dfd640","session_dhandle_name":"file:collection-44-7455917669258853182.wt","session_name":"WT_SESSION.compact","category":"WT_VERB_COMPACT_PROGRESS","category_id":8,"verbose_level":"DEBUG","verbose_level_id":1,"msg":" compacting collection-44-7455917669258853182.wt for 11044 seconds; reviewed 174577 pages, skipped 174577 pages, rewritten 0pages"}}}
{"t":{"$date":"2024-12-03T13:56:38.204+00:00"},"s":"I",  "c":"WTCMPCT",  "id":22430,   "ctx":"conn906278","msg":"WiredTiger message","attr":{"message":{"ts_sec":1733234198,"ts_usec":204899,"thread":"1:0x7f3370dfd640","session_dhandle_name":"file:collection-44-7455917669258853182.wt","session_name":"WT_SESSION.compact","category":"WT_VERB_COMPACT_PROGRESS","category_id":8,"verbose_level":"DEBUG","verbose_level_id":1,"msg":" compacting collection-44-7455917669258853182.wt for 11044 seconds; reviewed 174596 pages, skipped 174596 pages, rewritten 0pages"}}}
{"t":{"$date":"2024-12-03T13:56:39.086+00:00"},"s":"I",  "c":"WTCMPCT",  "id":22430,   "ctx":"conn906278","msg":"WiredTiger message","attr":{"message":{"ts_sec":1733234199,"ts_usec":86908,"thread":"1:0x7f3370dfd640","session_name":"WT_SESSION.compact","category":"WT_VERB_COMPACT","category_id":7,"verbose_level":"INFO","verbose_level_id":0,"msg":"Data handle compaction failed with EBUSY but the cache is not stuck. Will give it another go."}}}

This keeps repeating. The page count is increasing, but I cannot tell if compaction is actually doing anything useful. I can’t see any space being reclaimed at the moment. Any idea if this is actually progressing, or stuck in some loop where it is retrying the same thing over and over?

We’re running MongoDB v6.0.19, self hosted in GKE.

Adeola_Oladeinde2 · December 3, 2024, 10:30pm

The logs indicate the compaction process is scanning pages but skipping or failing to rewrite them, likely due to the pages being in use or locked (EBUSY). Compaction retries this process and may take significant time for large collections, especially if there’s ongoing activity. To ensure progress:

Confirm the secondary is truly idle to minimize page contention.
Monitor disk I/O and check if disk space changes after compaction completes.
If progress stalls for extended periods, consider using resync from a primary to fully rebuild the secondary’s data files.

N_J · December 4, 2024, 9:00am

Thanks for the reply.

Regarding your points:

Confirm the secondary is truly idle to minimize page contention.

We don’t send any queries to secondaries for this collection, so this is no additional reads as far as I know. The only activity is replication.

Monitor disk I/O and check if disk space changes after compaction completes.

I’m not sure how long it’s going to take to get there - it has been running for 24 hours so far. Based on disk utilization graphs, it looks like there have only been 2 noticable drops, and the amount of space reclaimed is only a small percent of what I’d believe is available.

If progress stalls for extended periods, consider using resync from a primary to fully rebuild the secondary’s data files.

We’ll consider that… unfortunately it sounds a bit heavy handed, I was really hoping compact would avoid the need for something like this.

Assuming the compaction process continues to stall, is it fine to stop it via db.killOp() ?