Dedup aggregation keeps failing on large time-series collection

Daniel_Lewis2 · November 18, 2024, 12:08pm

I have a large time-series collection with about 1.5 billion documents. Due to some migration issues, I have an unknown number of duplicates in that collection and I need to remove them.

In one collection, apcharts_mig, the duplicates have the same _id because of how they were migrated (and time-series collections cannot have distinct indexes). In the rankings collection, the IDs will be different but the duplicates would have the same meta objects and date (timestamp) values.

I’ve been trying all kinds of ways to make the dedup work, but it keeps failing on these two collections because of how many documents they each have. However, I can run the same commands in a much smaller collection and they work fine.

Here are my latest command attempts, entered directly into mongosh connected as localhost on the hosting server with the standalone MongoDB database:

`async function deleteDups(collection, batchSize = 1_000) {
let processed = 0
let cursor = null

try {
while (true) {
// Use allowDiskUse for large datasets and add maxTimeMS to prevent timeout
cursor = await db[collection]
.aggregate(
[
{
$group: {
_id: “$_id”,
count: { $sum: 1 },
docs: { $push: “$_id” },
},
},
{
$match: {
count: { $gt: 1 },
},
},
{
$limit: batchSize,
},
],
{
allowDiskUse: true,
}
)
.toArray()

  if (!cursor || cursor.length === 0) break

  // Process this batch
  for (const doc of cursor) {
    try {
      console.log(`Processing duplicates for ID: ${doc._id}`)
      // Skip the first document (to keep one) and delete the rest
      for (const duplicateId of doc.docs.slice(1)) {
        await db[collection].deleteOne({ _id: duplicateId })
      }
      processed++

      if (processed % 100 === 0) {
        console.log(`Processed ${processed} sets of duplicates`)
      }
    } catch (err) {
      console.error(`Error processing document ${doc._id}:`, err)
      continue
    }
  }

  // Increase delay between batches
  await new Promise((resolve) => setTimeout(resolve, 1000))
}

} catch (err) {
console.error(“Aggregation error:”, err)
}

console.log(Completed processing ${processed} total sets of duplicates)
}

// Call the function
deleteDups(“apcharts_mig”)

async function deleteDupsByMeta(collection, batchSize = 1_000) {
let processed = 0

while (true) {
// Find a batch of duplicate records with same meta and date
let duplicateIds = await db[collection]
.aggregate([
{
$group: {
_id: {
meta: “$meta”,
date: “$date”,
},
count: { $sum: 1 },
docs: { $push: “$_id” },
},
},
{
$match: {
count: { $gt: 1 },
},
},
{
$limit: batchSize,
},
])
.toArray()

if (duplicateIds.length === 0) break

// Process this batch
for (const doc of duplicateIds) {
  console.log(`Found duplicates for meta/date:`, doc._id)
  // Skip the first document (to keep one) and delete the rest
  for (const duplicateId of doc.docs.slice(1)) {
    await db[collection].deleteOne({ _id: duplicateId })
  }
  processed++

  if (processed % 100 === 0) {
    console.log(`Processed ${processed} sets of duplicates`)
  }
}

// Add a small delay between batches to prevent overwhelming the connection
await new Promise((resolve) => setTimeout(resolve, 100))

}

console.log(Completed processing ${processed} total sets of duplicates)
}

// Call the function
deleteDupsByMeta(“rankings”)


Both of these functions will run for several minutes on either collection but then fail with the message like `MongoNetworkError: connection 6 to 127.0.0.1:5017 closed`.

What can I do to make these operations work and remove my duplicates?

(And do you see any other problems with this? Like I *think* the `batchSize` is okay and is limited only how many duplicates are found, not how many records its searching for duplicates, right?)

This is on a Hetzner VPS with 8 vCPUs, 16 GB of RAM, and 4 GB swap. I watch the resource usage with htop and these operations are never maxing out any resources. So I don't think I need to vertically scale.

And just for illustration of how difficult it is to process these massive collections, I can't even run `countDocuments()` without its failing.

Thanks in advance!