PyMongoArrow data loss

Hi, I am using pymongoarrow for data fetch automation in one of my projects and encountered a strange behaviour in how it fetches the results. For this specific project I need 2-3 month of historical data. If I put a period of 1 month in the query individually for each month there is no issues with the data, but if my query has a period of two month or more then there is a specific data loss pattern happening at random in some of the results. Specifically there null values start appearing in one of the nested fields of the collection called Item.Id, all of the other fields (Item.Price, Item.Tax, …) for the same _id are returned without issues. In total out of 8 million documents about 100000 has this type of problem. If I query only one of those problematic _id by putting it in the query the results are returned as normal (no null in Item.Id). Querying data in bson format directly from MongoDB using mongodump does not result in this issue appearing as well. I suspect there might be some issues when processing bson inside PyMongoArrow library. This issue happens in both find_arrow_all and aggregate_arrow_all functions. Please let me know if anyone had a similar issue or know what might be the cause of this behaviour.

Hi @Leonid_Posadskov,

Sorry to hear that you are facing this issue. I have created a jira ticket for our team to investigate this issue further. We would reach out to you if we need any more information.

You can track the progress of this investigation here: https://jira.mongodb.org/browse/ARROW-250

If you have any other thoughts/feedback on PyMongoArrow, please let me know.

Thanks,
Shubham

1 Like

Hi @Leonid_Posadskov, we believe PyMongoArrow 1.6.0 Released should address the underlying problem, can you please confirm?

Hi @Steve_Silvester, I just run a test with 1.6.0 and got the following error:

    arrow_table = self.db.Collection1.find_arrow_all(query=query_filter, projection = projection)
  File "/home/posadskovleonid/.local/lib/python3.9/site-packages/pymongoarrow/api.py", line 106, in find_arrow_all
    context.process_bson_stream(batch)
  File "/home/posadskovleonid/.local/lib/python3.9/site-packages/pymongoarrow/context.py", line 46, in process_bson_stream
    self.manager.process_bson_stream(stream, len(stream))
  File "pymongoarrow/lib.pyx", line 203, in pymongoarrow.lib.BuilderManager.process_bson_stream
  File "pymongoarrow/lib.pyx", line 215, in pymongoarrow.lib.BuilderManager.process_bson_stream
  File "pymongoarrow/lib.pyx", line 197, in pymongoarrow.lib.BuilderManager.parse_document
  File "pymongoarrow/lib.pyx", line 192, in pymongoarrow.lib.BuilderManager.parse_document
  File "pymongoarrow/lib.pyx", line 187, in pymongoarrow.lib.BuilderManager.parse_document
ValueError: ('Could not append raw value to', 'Items[].Id')

Would appreciate any advice on what to look for to solve this.

Hi @Leonid_Posadskov, that error most likely means that you have run out of memory in the memory pool that PyArrow uses to store the data. We are going to make the error more obvious in the next release. For now, I would suggest adding filters to limit the number of matching documents and break up your queries.