According to the mongosh documentation on insert_many, when the function is called in parallel mode (with ordering=False argument) and if one of the inserted documents causes an error, the function is supposed to raise a BulkWriteError.
However, in PyMongo when one of the documents is too large (exceeds 16MB), an OperationFailure error is raised instead. Why is that? Can I configure PyMongo to raise BulkWriteError instead somehow?
Here is my reproducible code for demonstration:
from pymongo import MongoClient
from pymongo.errors import BulkWriteError, OperationFailure
documents = [
{'x': 'a'}, # valid
{'x': 'b' * 20_000_000}, # invalid, causes error (too large)
{'x': 'c'} # valid
]
with MongoClient("mongodb://localhost:27017/") as client:
collection = client['my_db']['my_collection']
try:
result = collection.insert_many(documents, ordered=False)
except BulkWriteError:
print('Does not enter here') # this exception is not raised
except OperationFailure:
print('Enters here') # this exception is raised
To answer the first question of “why?”:
Raising an OperationFailure in this case makes sense especially when you take a look at the behavior of the code snippet you provided. Referencing the mongosh documentation:
If ordered is set to false and an insert fails, the server continues inserting records.
In the case of the bulk write, the failed operation is failing in a different way because the server will not continue to write anything else.
In terms of raising a different error, could you walk me through why you would want to raise the BulkWriteError instead of an OperationFailure? Or is it more that you would like the remaining documents to get written while the documents that are too large return in the same BulkWriteError format?
Or is it more that you would like the remaining documents to get written while the documents that are too large return in the same BulkWriteError format?
Exactly!!! Thank you for understanding.
I want the other valid documents (originally inserted together with the oversized document) to be inserted, but in my case nothing gets inserted despite ordering=False.
The reason why I prefer the BulkWriteError is because it contains ALL the documents that caused the error within its details field (unlike the OperationFailure which contains only a single document _id). This is useful for attempting to re-insert the oversized documents in chunks after catching the error, especially since I call insert_many with a generator of documents.
Side note: If I recreate the test case in which 2 documents have identical _id as mentioned in the documentation, then I do get a BulkWriteError but not in my case when there is an oversized document.
This is behavior we plan to change however changing the exception type is potentially breaking behavior so we need to wait until pymongo 5.0. See https://jira.mongodb.org/browse/PYTHON-4085 and related tickets.
Hi Shane, thanks for making me aware there is a ticket open for this.
In the meantime, for anyone reading this… the workaround will make my code not optimal, but instead of a generator of documents, I will use a list. Then I will create a reverse lookup dict (custom_id → document) before calling insert_many so if there is an OperationFailure I can look up the ID (extracted from the exception) and re-insert the oversized document in chunks. However, if there are multiple oversized documents, this is horrible because the OperationFailure only returns the last document ID that caused the error, not all IDs.
BTW, I tried checking the sizes of the documents before inserting but the bson.encode method is very slow. Not worth it for millions of documents.
Thanks for the feedback. I see how working around this issue can be painful.
re-insert the oversized document in chunks
Interesting, if you are able to split up the oversized document after the fact then perhaps you can instead split them before calling insert_many? For example, in the simplified repro code the app could check the size of field 'x' and split it into 2 (or more) documents.
As I mentioned, I tried to verify the BSON sizes before inserting, but for my use case of a million documents this operation is too slow for my server (takes hours). Unfortunately, my documents are more complex than the simple toy example I gave (with nested dicts) so a manual approach is not feasible.
I did try to use the same C method that PyMongo wraps to create BSON docs:
from bson._cbson import _dict_to_bson
This method is still too slow for many documents. This makes me wonder how does the Mongo server perform the BSON document size calculation so fast? If I can crack that problem, maybe I can check sizes in advance, otherwise I’m stuck with my proposed solution above.