Looking for an approach to reliably extract data from a MongoDB collection

Dmitriy_Bykov · February 17, 2025, 3:31pm

I am modeling a system that should use a MongoDB collection as an insert-only data storage, and I’m looking for a way to reliably query stored documents for processing.

The process of extracting data should be able to pause and resume data consumption from the point it stopped last time, and it may be restarted from the beginning at any time.

Change streams are nice, but they only work within oplog, so extracting older documents is impossible.

Using a monotonically increasing field and a query like

{id_field: {$gt: last_processed_value}}, {$sort: {id_field: 1}}

seems like a suitable solution, but I’m struggling to find a way to generate auto-incrementing values in MongoDB reliably.

There is a default timestamp-based _id field, but its temporal nature makes it susceptible to system clock skew, and clock drifting up to few hundred milliseconds during NTP sync is not something unheard of. If a later inserted document would have its timestamp value lower than a previously inserted, such document may be missed by the abovementioned query, and this is unacceptable for my application.

Using an external monotonically increasing ID generator is possible, but to guarantee the correct insertion order, all the inserts must be done sequentially, in a single thread, because otherwise there is always a chance that a process that acquired higher ID would issue an insert query earlier than a process with lower ID.

I would like to know if there are other approaches to implement this logic without using an external change stream like Kafka.

Dmitriy_Bykov · February 17, 2025, 3:36pm

I’d like to add that I prefer not to store any sort of fields within collection which should indicate if a document was already processed, because updating it every time a document is processed would lower read performance significantly.