Concurrency with Aggregation Pipelines

Nathan_Toung · August 14, 2023, 8:19pm

Hi all,
I’m attempting to build a distributed, denormalization service. My concerns are centered around concurrency. I have collections a, b, and c. The relationships between a and b/c are many to many. I want to write the contents of b and c into a in a structure like this:

a1: {
   _id: 1,
   b: { },
   c: { }
},
a2: {
   _id: 2,
   b: { },
   c: { }
}

The pipeline I would like to write would look something like:

a_collection.aggregate([{
    b: { $exists: true }  // match
}, {
  $lookup: {
     from: "b_collection",
     localField: "b", // overly simplified example, b on the local doc would be an id
     foreignField: "_id"
     as: "b"
  },
  $merge: {
      into: "a_collection",
      on: "_id"
  }
}])

My understanding is that writes to single documents using findAndUpdate are atomic and optimistically concurrent but I’m looking for a findManyAndUpdateMany that would provide the same guarantees.

I’ve tried using an aggregation pipeline with a $lookup and $merge operator (but the documentation state that aggregation pipelines only seem to apply a read (intent shared) lock). Does the $merge operator apply a write (intent exclusive) lock? Does it apply it only at the final stage or does Mongo detect an update and apply the lock to all the docs that are match in the aggregation pipeline? Or does it apply locks as it’s processing each individual document in the pipeline?

I’ve also tried using an update() but I can’t use $lookup and so I’m not able to lookup from b and c in one atomic operation.

Is there another approach I should be taking? Should I just use a transaction? Is there a way for me to explicitly/implicitly apply different locks?

Thanks in advance!

Nathan_Toung · August 14, 2023, 10:51pm

To clarify, there would be c version of the same aggregation pipeline that would run in parallel. In theory, one of the lookups could have stale data and overwrite the update of the other pipeline.