Parallelizing mongoexport to dump large amounts of data from multiple collections

Hi everyone!

I’ll start with what I’m trying to do and then get to what hurdles I’ve met:

I made a nodejs mini app that :

  • Gets all the collections from my database
  • For each collection is spawns a worker that runs a mongoexport command with a query that exports to a json using the jsonArray flag

The aim of this app is to export all the information related to…let’s say a customer, from all collections

The main issue that I’m getting is that the mongoexport commands work by themselves but when I try to run them in parallel using the worker threads they all fail with

could not connect to server: connection() : dial tcp: i/o timeout

For now I’m stuck exporting iteratively but it would be great if I could do this async in some way, this is because I’m dealing with over 15gb of data .

Another question would be if I could parallelize AND divide mongoexport commands in batches for example: job1 exports the first 10000 results, job2 from 100001 to 20000 and so on …

I would be grateful if anyone has an idea on how I can manage this. Thanks in advance!

@Tudor_Palade , please modify the snippet to meet your needs . This seems working fine . Try not to execute all promise all once , it might bring up connection issue or whatsoever .

const { MongoClient } = require('mongodb');
// your connection uri
const uri = 'XXX;
const client = new MongoClient(uri);
const dbName = 'sample_vector_search';
const { exec } = require("child_process");
const { resolve } = require('path');

async function main() {
    await client.connect();
    console.log('Connected successfully to server');
    const db = client.db(dbName);
    const collectionName = 'restaurant_reviews';
    const restaurant_reviews = db.collection(collectionName);
    const count = await restaurant_reviews.find({}).count();
    const tasks = [];
    const limit = 1000;
    for (let index = 0; index < count; index =  index + limit) {
        tasks.push(spawnExport(uri, dbName, 'restaurant_reviews', `${dbName}_${collectionName}_batch${index}`, index + limit, limit))
    }

    // Use any library of your choice to better handle the tasks
    // process tasks array in chunk , ideal is chunk = number of core 
    Promise.allSettled(tasks).then(result => {
        console.log('Done')
    }).catch(error => {
        console.log('Error')
    })

    return 'done.';
}


function spawnExport(uri, db, col, filename, skip, limit) {

    return new Promise((resolve, reject) => {
        const command = `mongoexport --uri ${uri} -d ${db} -c ${col} -o ${filename}.json --skip=${skip} --limit=${limit}`;
        console.log(command)
        exec(command, (error, stdout, stderr) => {
            if (error) {
                console.log(`error: ${error.message}`);
                return reject(error.message);
            }
            if (stderr) {
                console.log(`stderr: ${stderr}`);
                return reject(stderr.message);
            }
            return resolve();
        });
    });



}

main()
    .then(console.log)
    .catch(console.error)
    .finally(() => client.close());

Thank you so much, I’ll give this a try !
The tasks divided by a total count of documents is a good idea, although at the moment I am using countDocuments instead of count and apparently it brings lots of performance issues when querying big collections with billions of entries :confused:
But I’m pretty sure that this won’t exactly solve running 30+ exports at the same time.

I’ll definitely try this out and compare performance with the current setup though.