Hi everyone!
I’ll start with what I’m trying to do and then get to what hurdles I’ve met:
I made a nodejs mini app that :
- Gets all the collections from my database
- For each collection is spawns a worker that runs a mongoexport command with a query that exports to a json using the jsonArray flag
The aim of this app is to export all the information related to…let’s say a customer, from all collections
The main issue that I’m getting is that the mongoexport commands work by themselves but when I try to run them in parallel using the worker threads they all fail with
could not connect to server: connection() : dial tcp: i/o timeout
For now I’m stuck exporting iteratively but it would be great if I could do this async in some way, this is because I’m dealing with over 15gb of data .
Another question would be if I could parallelize AND divide mongoexport commands in batches for example: job1 exports the first 10000 results, job2 from 100001 to 20000 and so on …
I would be grateful if anyone has an idea on how I can manage this. Thanks in advance!
@Tudor_Palade , please modify the snippet to meet your needs . This seems working fine . Try not to execute all promise all once , it might bring up connection issue or whatsoever .
const { MongoClient } = require('mongodb');
// your connection uri
const uri = 'XXX;
const client = new MongoClient(uri);
const dbName = 'sample_vector_search';
const { exec } = require("child_process");
const { resolve } = require('path');
async function main() {
await client.connect();
console.log('Connected successfully to server');
const db = client.db(dbName);
const collectionName = 'restaurant_reviews';
const restaurant_reviews = db.collection(collectionName);
const count = await restaurant_reviews.find({}).count();
const tasks = [];
const limit = 1000;
for (let index = 0; index < count; index = index + limit) {
tasks.push(spawnExport(uri, dbName, 'restaurant_reviews', `${dbName}_${collectionName}_batch${index}`, index + limit, limit))
}
// Use any library of your choice to better handle the tasks
// process tasks array in chunk , ideal is chunk = number of core
Promise.allSettled(tasks).then(result => {
console.log('Done')
}).catch(error => {
console.log('Error')
})
return 'done.';
}
function spawnExport(uri, db, col, filename, skip, limit) {
return new Promise((resolve, reject) => {
const command = `mongoexport --uri ${uri} -d ${db} -c ${col} -o ${filename}.json --skip=${skip} --limit=${limit}`;
console.log(command)
exec(command, (error, stdout, stderr) => {
if (error) {
console.log(`error: ${error.message}`);
return reject(error.message);
}
if (stderr) {
console.log(`stderr: ${stderr}`);
return reject(stderr.message);
}
return resolve();
});
});
}
main()
.then(console.log)
.catch(console.error)
.finally(() => client.close());
Thank you so much, I’ll give this a try !
The tasks divided by a total count of documents is a good idea, although at the moment I am using countDocuments instead of count and apparently it brings lots of performance issues when querying big collections with billions of entries
But I’m pretty sure that this won’t exactly solve running 30+ exports at the same time.
I’ll definitely try this out and compare performance with the current setup though.