Increase max document size to at least 64MB

Initially limit was 4MB, then it was raised to 16MB in 2009.
We are in 2021, we have better hardware, better network, bigger documents and competitors without 16MB limit.
There are a ton of use cases where this limit is too small nowadays, sensors, large documents with history of data.
I’ve opened a new JIRA Issue on this here: https://jira.mongodb.org/browse/SERVER-60040

9 Likes

Ticket on Jira has been closed with this comment: “ Thanks for your report. Please note that the SERVER project is for bugs for the MongoDB server. As this ticket appears to be an improvement request, I will now close it.”

But main Jira page for MongoDB says:

Which JIRA project should I use to report bugs or feature requests?

  • To report potential bugs, suggest improvements, or request new features in the MongoDB database server , use Core Server (SERVER) .”
1 Like

Hi @Ivan_Fioravanti welcome to the community!

Apologies for the confusion. The description of the SERVER project is a bit outdated. We currently are using https://feedback.mongodb.com/ to collect ideas on how to improve the server, and dedicated the SERVER JIRA project for bug reports. Specifically in your case, you would want to go to the Database Section on that page.

Best regards
Kevin

1 Like

Hi @Ivan_Fioravanti

Since only data that’s always used together should be stored together, I’m curious about the use case that requires documents bigger than 16MBs. You mention tons of use cases, but can you be a bit more specific? Like “history of data” - this seems like a problematic example since eventually it will outgrow any document size limit, and I’m not sure I can think of a use case where you need all history (no matter how old) when reading a document.

Asya

1 Like

Hi @Asya_Kamsky
there are many examples in the https://jira.mongodb.org/browse/SERVER-5923
everyone starts using MongoDB thinking: 16MB is a lot! I’ll never hit this limit, but when you reach it is a mess.

Also this one would be extremely beneficial https://jira.mongodb.org/browse/SERVER-12305 complex aggregations with many pipeline can hit this limit more often than you think.
Removing this limit shoild be easier, please plan at least this one.

Thanks,
Ivan

1 Like

That’s because it usually is an indication of incorrect schema design.

Asya

1 Like

I kindly disagree: I have financial data of a currency symbols like USDJPY (US-Dollar vs. Japan Yen) and USDCHF. Each daily candle contains a timestamp and 4 prices: open, high, low and close price.

I’ve been implementing mongo queries and complex analysis routines for many years and was - until now - happily using just one document per symbol. Just recently I figured out that from more than 3000 financial instruments USDJPY and USDCHF are the only ones that have such a huge data history (dating back to January 1970) that they exceed 16MB and thus cannot be stored entirely.

With this 16MB limit I would now have to go through dozens of complicated methods and implement additional “boilerplate” logic to read in chunks of history data and deal with highly increased complexity of analysis routines that now need to see beyond the borders between a set of chunks. No fun, seriously.

I do like to work with MongoDB and I don’t mind difficult tasks, but implementing additional logic just because there is no way to increase a tightened memory “budget” seems utterly wrong. At least to me. Not to mention that the whole additional logic reduces the readability of the code and lowers its performance.

If there’s chance, can you at least provide a parameter in MongoDB with the default of 16MB, and in case people really need more memory, then they have the freedom to do so?

1 Like

The only way to have larger document size would be to change the limit in the source code and to recompile/build it yourself and then run your own changed copy. I wouldn’t recommend it though because there are a lot of other places where things can go wrong - drivers assuming documents will not be larger than 16MBs also.

It’s hard to imagine that it’s actually required to have full history of a symbol in a single document. Do you always analyze the full history when querying it? If not then it’s quite inefficient to keep it all on a single document.

Asya

Similar use case here…

I’m using mongoDB to store colelctions of 10000 frames of a moving system of 200.000 particles. Each frame stores 13 floating-point values per particle. So a single frame holds 2.6 million floating point values (20.8 MB). It is just not practical to split frames into more than 1 document.

I have 128 processors, 16 TB of SSD and 1 TB of RAM on the server… could anybody explain the logic behind the 16MB document limit?. Sounds a bit 2009.

How are you using this data? Do you always fetch all 10000 frames of all the values? Sure, 16MBs is arbitrary, but 10 thousand seems arbitrary also, why not five thousand?

Asya

1 Like

Hi Asya.

For my case scenario, each frame is 2 picoseconds of motion. 10k frames are 20 nanoseconds, which is the minimum window I need to measure. Sampling less time would be statistically insufficient, and reducing sampling times would induce aliasing.

After saving I need to build n-dimentioinal dataframes and do some physics with it. I’m still thinking wether I should do pipelining or work everything from python.

Would you go for GridFS? or definetly compile with max document size to… say 64Mb?.

Best!
Pedro

1 Like

I don’t fully understand the implications of forcing a bigger document size. Hence the question.

As for the usage, i need to query each particle in all 10.000 frames and compute its motion. This can be done by getting all 13 attributes of the same particle ID from all documents (if each frame is 1 document).

So 13x8x10000 is 1MB per particle. But then each 20MB frame should fit in a document.

I’m thinking splitting frame data in 2 collections would do… but its far from ideal.

1 Like

Mongo Manual said

we can use embedded documents and arrays to capture relationships between data in a single document structure instead of normalizing across multiple documents and collections, this single-document atomicity obviates the need for multi-document transactions for many practical use cases.

So we tend to use embed documents,but sometimes one document can be very large,in our project it may reach 30M or more, so we must split it and keep reference relation, it gonna be very complicated and this way mongo doesn’t support muti-doc txn in single server, it’s so wired.
Redis has supported RedisJson Module and the limit is 512M, I wish mongo increase this limit and support JsonPath.
And i want to know why mongo only support muti-doc txn in replica and shard, sometimes we want to test transaction but mush deploy replica, it’s troublesome

I’m not sure what you are talking about - MongoDB has supported transactions (included across shards) for years now…

Asya

1 Like

Hi @timer_izaya,

Multi-document transactions rely on the replication oplog which is not present on a standalone mongod deployment.

However, you can deploy a single node replica set for test purposes.

Regards,
Stennie

Don’t you love it when a company tells you it’s your business that’s wrong, not their product.

We have also crashed into the 16MB document size limit in another financial use case. The application design is sound, the issue is not the schema or the way in which we are using the tool. The issue is the tool, so we have little choice but to switch to another tool unfortunately.

It seems that if multiple customers, in different industries, with very different use cases, are all struggling with the limitation it would be prudent for that company to ask whether it’s really the customers who have gotten it wrong, instead of spikily insisting that they don’t know what they are doing or don’t understand their own domain.

5 Likes

I have no choice but to implement a generic method to split huge json.

Same here I have simple website builder where we keep pages for easy css and text manipulation.

We’re storing documents with images and lidar pointclouds, using MongoDB as a geospatial database. Each pose of the vehicle is stored as a geojson point and queried by geographic location and radius. The collection reaches 100GB with 10min of vehicle travel (and images less than 16MB) Because we need better resolution, we’re hitting that 16MB limit now with just the image sizes. I understand the issue is a limit in BSON and there’s another extension, GridFS, to store large BLOBs by writing them directly to the filesystem in chunks. It’s my opinion that this makes the database, software, and filesystem more complex to manage, unless I’m missing something.

The reason we were sold on MongoDB was it’s performance, nosql, and geospatial indexes. We’re starting questioning the true performance after using MongoDB for 4 years. Each write to the collection, with just one spherical index, takes .5s on an i9, 64GB ram, with a Samsung 970EVO ssd. Granted, this isn’t the fastest machine out there, but we’re limited to what we can fit and power on an electric vehicle. What we’re learning about database systems, like Cassandra, is they’re faster and have document size limits (including BLOBs) of 2GB. I’d really hate to rewrite my data abstraction layer.

Also, the bsoncxx and mongocxx API documentation is really lacking and needs updating. It’s mostly built with doxygen, with very few descriptive comments and few if any code examples. I had to read the source code to figure out writing and reading binary. It took me way longer to figure out how to write binary than was necessary. Here’s an example of the only official documentation I could find on it.
I don’t know how anyone could get this from that documentation:

  bsoncxx::types::b_binary b_blob 
  { 
    bsoncxx::binary_sub_type::k_binary,
    sizeof your_array_or_object,
    reinterpret_cast<uint8_t*>(&your_array_or_object)
  };

@Matthew_Richards what are you storing in a single document? Normally images tend to be stored separately from various other “metadata” in part because letting documents get really large means that all operations on the document will be slower - and if you’re doing any sorts of updates or partial reads of the document then you’d be better off to have things stored separately.

Asya

1 Like