Disk space usage larger than expected (compared to Storage size)

Vlad-Petru_Nitu · June 7, 2023, 10:13am

Hi community!
I am currently working on a Software Project, where I’ve set up an M10 Dedicated cluster on MongoDB Atlas. The statistics of my database look like this:

As it can be seen in the picture above, my storage size is 554.83MB. If I run db.stats() in mongosh, I obtain the following statistics:


Atlas atlas-fmbop9-shard-0 [primary] news_db> db.stats()
{
  db: 'news_db',
  collections: 4,
  views: 0,
  objects: 1040164,
  avgObjSize: 758.4941961075369,
  dataSize: 788958357,
  storageSize: 581779456,
  indexes: 4,
  indexSize: 18321408,
  totalSize: 600100864,
  scaleFactor: 1,
  fsUsedSize: 7438405632,
  fsTotalSize: 10726932480,
  ok: 1,
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1686131293, i: 1 }),
    signature: {
      hash: Binary(Buffer.from("448344519386fd7a20bfe7c11263d4977bbd771b", "hex"), 0),
      keyId: Long("7241613558851567622")
    }
  },
  operationTime: Timestamp({ t: 1686131293, i: 1 })
}

As it can be seen in the code block above, the fsUsedSize is 7438405632 Bytes ~= 6.92GB.

I have recently received an e-mail from MongoDB Atlas that states that I have almost reached the limit of the 10GB size of my cluster.
My question is: Is that “Storage” of 10GB offered by my plan refering to the “Disk space used”, and if so: how can I get some more insight about what composes this huge disk space usage? My gut feeling is that this large discrepancy between “storageSize” and “Disk space used” may indicate some problems in configuring the database, but I do not how to debug this issue properly.

news_collection document example:

Atlas atlas-fmbop9-shard-0 [primary] news_db> db.news_collection.findOne()
{
  _id: 'https://www.npr.org/2023/05/16/1176508568/biden-meets-with-congressional-leaders-as-debt-limit-deadline-looms',
  fingerprints: [
    374865, 241001,  48745, 259731,  70017, 156617, 673765, 331956,
    270995, 130199,  45513, 457174, 113615, 603545, 116103, 163152,
    408810,  18232, 522353, 541818, 182149,  84889, 438504, 257512,
    256602, 460686, 287592,  59625, 261816, 189635, 346076, 291558,
     14098, 164881,  39145, 189446, 291303,  35600, 298657,  83163,
    313783, 278993, 182793, 157549, 201840, 527436,  21982, 380645,
     56697,  48363,  67714, 113824,  94203, 144955,  37354, 199260,
    249434, 268125, 366513, 196130, 181955,  11618, 337985, 553825,
    224547, 398779, 323990, 299317,  41589, 234623, 167918, 155028,
    214631, 397630, 395398,  62818, 310355,  10561, 359734,  99014,
    641902, 155804, 224220,  91110, 109134, 162947,  21310, 322004,
    413283,  75042, 346425, 420857, 222899,  58817,  55335, 159492,
    189573, 267054, 140012, 180607,
    ... 976 more items
  ]
}

hashes_collection document example:

Atlas atlas-fmbop9-shard-0 [primary] news_db> db.hashes_collection.findOne()
{
  _id: 126992,
  urls: [
    'https://www.npr.org/2023/05/16/1176508568/biden-meets-with-congressional-leaders-as-debt-limit-deadline-looms',
    'https://100percentfedup.com/classless-chicago-mayor-told-trump-f-u-now-wants-to-fire-cop-who-flipped-off-protesters/',
    'https://5pillarsuk.com/2020/06/21/deadly-reading-stabbings-declared-a-terrorist-incident/',
    'https://5pillarsuk.com/2020/10/30/jumuah-khutbah-islam-is-not-in-crisis-french-secularism-is/',
    'https://5pillarsuk.com/2021/03/19/tanzania-has-a-new-female-muslim-leader/',
    'https://5pillarsuk.com/2021/07/26/tunisia-in-crisis-as-president-is-accused-of-launching-coup/',
    'https://92newshd.tv/about/at-least-40-killed-one-still-missing-in-turkey-mine-blast',
    'https://academicfreedom.org/letter-to-concordia-wisconsin-on-gregory-schulz/',
    'https://z3news.com/w/antichristbeast-systemglobal-governance-unveiled/',

To provide more insights about the structure of my documents, I will list below one document from news_documents collection, and another one from hashes_collection:

I am also open to (and would kindly appreciate) hearing your opinion on whether a NoSQL database storage system may not be the most viable option in my case, and whether it would be better to migrate to a relational database instead.

Thank you in advance for your time, feedback and support!

-Vlad

kevinadi · June 8, 2023, 4:44am

Hi @Vlad-Petru_Nitu welcome to the community!

This can happen if you deleted a lot of documents in those collections. WiredTiger don’t usually give space back to the OS, since it assumes that you’ll have more data in the future. Giving space after deletes then reclaiming it back later would be a wasted effort in this case. Rather, WiredTiger will keep those space available for reuse.

You might want to check the output of db.collection.stats() for all the collections in your database. Especially the numbers for storageSize and freeStorageSize. This will give you an idea of how much space available to be reused.

Alternatively, you might be able to run compact command to reclaim space. This needs to be done with care, and I encourage you to follow the procedure outlined in that page before performing this operation.

This depends on your use case. There are use cases that’s easier for a NoSQL solution, and the other way around. I don’t think there’s a single correct answer for every use case for this question, even for similar-looking ones. Are you seeing some performance issues, usability issues, or anything else with MongoDB currently?

Best regards
Kevin

Vlad-Petru_Nitu · June 8, 2023, 6:17am

Dear Kevin,

Thank you for your quick reply! In terms of performance, all the queries I had to perform worked as expected. My only concern is that the disk space usage is too large (compared to the Storage Size that I am currently using), and I am going to exceed the constraints of the plan I have paid for. I assume that this disk space usage comes from the “oplog.rs” collection of “local” database. Is there any way to mitigate this high usage level?

I have to note that I did not perform any deletion operations throughout the existence of my collection, but I have added 10k documents in “news_collection” (and their associated 100k hashes in “hashes_collection”) in short period of time (~ 6 hours)
Thanks in advance.

Moreover, the storageSize command works, but the freeStorageSize does not return anything (I wasn’t able to find freeStorageSize in the documentation either).

Best,
–Vlad

kevinadi · June 9, 2023, 4:49am

Hi @Vlad-Petru_Nitu

By default, Atlas tries to keep at least 24 hours of oplog window. So if you just added a lot of documents within the last 24 hours, you may see an elevated oplog size in the short term. This should go down to a more workload-related size once the deployment goes into a steady state. Do you mind checking after a day or two if space is still an issue?

Alternatively, by default WiredTiger compresses data using Snappy, which provides good compression performance vs. speed. You can change this using the server-wide block compression setting to use e.g. zlib, whose tradeoff is more compression but less performance.

You can also set this compressor setting per-collection, see my response on StackOverflow on how to do this. Basically you execute the createCollection command:

db.createCollection('zlib', {storageEngine: {wiredTiger: {configString: 'block_compressor=zlib'}}})

Note that using compression on the collections doesn’t change the oplog retention of 24 hours, so you’ll still see this after a large import.

Sorry I should’ve provide more details. freeStorageSize is not a command, but a field in the output of the db.collection.stats(). For example if I have a collection named test:

> db.test.stats()
{
  ns: 'test.test',
  size: 66,
  count: 2,
  avgObjSize: 33,
  numOrphanDocs: 0,
  storageSize: 36864,
  freeStorageSize: 16384,
  capped: false,
  ....

It shows the storage size in bytes, and freeStorageSize which is space available for reuse. See https://www.mongodb.com/docs/manual/reference/command/dbStats/#mongodb-data-dbStats.freeStorageSize for more details.

Best regards
Kevin

Vlad-Petru_Nitu · June 9, 2023, 7:51am

Dear @kevinadi ,

Thank you for your comprehensive answer. The “Disk storage used” has not yet decreased since I last added the bulk of documents to my database (~30 hours ago), but I am willing to wait more in order to observe whether the space is still an issue or not.

When it comes to running the stats() command from an arbitrary collection, the returning values I get are different from those that you provided in the code above. Here is a snippet of what is outputted when I do .stats():

 sharded: false,
  size: 1035091724,
  count: 499626,
  numOrphanDocs: 0,
  storageSize: 5177339904,
  totalIndexSize: 0,
  totalSize: 5177339904,
  indexSizes: {},
  avgObjSize: 2071,
  maxSize: 1038090240,
  ns: 'local.oplog.rs',
  nindexes: 0,
  scaleFactor: 1

Thank you for your time!

Best,
–Vlad