Hi everyone,
I’m using the Spark connector (10.3.0) to copy data from BigQuery to Mongo. About 100M small records (about 250 bytes each). Beside the index on _id: ObjectID
, there is one compound index on two fields.
When I tried to use “insert” only, with a M10 cluster, sometimes I got 10+K inserted documents per second, but most of the time I got only ~2K/sec.
When I tried to use “replace + upsert”, I usually got 0.6-0.7K/sec.
When the performance was bad (~2K), CPU IOWAIT was usually high, between 80-130%.
I tried to scale the cluster up to M30. Despite CPU IOWAIT reduced to about 50%, the performance hadn’t improved.
My question is, if I want to finish upserting that amount of data in a short period of time like 1-2 hours, what can I do? What is the typical bulk write speed should I expect from a M10 or M30 cluster?
P/S: I’m using serverless GCP DataProc to run Spark. There’s hardly any transformation needed - just read from BigQuery and write to Mongo.
Thank you