About insertMany (or Bulkwrite insertOne) in hashed shard environment

Chiu_Chun_Yu · October 21, 2023, 3:51pm

I am testing about insertMany (or Bulkwrite insertOne) in hashed shard environment

My environment is a 5-node cluster
‘shard1/node1:27021,node2:27021,node3:27021’,
‘shard2/node2:27022,node3:27022,node4:27022’,
‘shard3/node3:27023,node4:27023,node5:27023’,
‘shard4/node1:27024,node4:27024,node5:27024’,
‘shard5/node1:27025,node2:27025,node5:27025’,

I am trying to study about difference between range key and hashed key sharding.

##chunksize
use config
db.settings.updateOne({_id:“chunksize” },{$set:{_id:“chunksize”,value:5}},{upsert:true})

##Range key
sh.shardCollection(“test.testloop”,{“key”:1},true)

##Hashed key
sh.shardCollection(“test.testloop”,{_id:“hashed”})
sh.shardCollection(“test.testloop”,{“key”:“hashed”})

And I find that insertMany (or Bulkwrite insertOne) in Hashed-key environment is much more slower than in Range-key environment.
I want to know if it is normal. According to the document,the Hashed-key environment should have good performance in write action.

#insertMany
var loop = []
var feed = (min,max) => {
for (var i=min; i<= max; ++i) {
loop.push({key:i})
}
db.testloop.insertMany(loop)
}
feed(1,100000)

#bulkWrite
var loop = []
var feed = (min,max) => {
for (var i=min; i<= max; ++i) {
loop.push({insertOne:{key:i}})
}
db.testloop.bulkWrite(loop,{ordered:“false”})
}
feed(1,100000)

steevej · October 21, 2023, 10:50pm

I suspect that the fact that your key are consecutive is detrimental to the performance of one and beneficial to the performance of the other.

I can only suggest to try with random keys.

Kobe_W · October 22, 2023, 3:54am

where do you read this, can you share me the link?

I doubt if there’s any big difference between hash shard and range shard, after all, hash is using a hash value, which is a string/integer. (then the value can be compared to decide the “range”).

Chiu_Chun_Yu · October 22, 2023, 4:36am

If you mean using “_id”, I have tried and the performance is still much more slower than range key.
Both “_id” and key named “key” are tested as a shard key in this case.

##Hashed key
sh.shardCollection(“test.testloop”,{_id:“hashed”})
sh.shardCollection(“test.testloop”,{“key”:“hashed”})

ref: https://www.mongodb.com/docs/manual/core/hashed-sharding/

The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys are ideal for shard keys with fields that change monotonically like ObjectId values or timestamps. A good example of this is the default “_id” field, assuming it only contains ObjectId values.

Chiu_Chun_Yu · October 22, 2023, 4:49am

For Hashed sharding, the workload will separated into different shards (I know it will reducing Targeted Operations vs. Broadcast Operations, but my case is insert only). For my environment (5-node cluster), I expect the insert workload separated into 5 different hosts (In range key, the insert workload is focus in one host only).

Yes. But the true is hash shard uses 3-4min but the range shard uses only 10 seconds for same size insert in my test.

ref: https://www.mongodb.com/docs/manual/core/hashed-sharding/

Hashed sharding provides a more even data distribution across the sharded cluster at the cost of reducing Targeted Operations vs. Broadcast Operations. Post-hash, documents with “close” shard key values are unlikely to be on the same chunk or shard - the mongos is more likely to perform Broadcast Operations to fulfill a given ranged query. mongos can target queries with equality matches to a single shard.

steevej · October 23, 2023, 10:37pm

There is something strange as I am very surprised to see

yet

Where is your mongos running?

Where are the config server replica set instances running?

You run multiple mongod instances on the same machine, that is not really realistic. Do you know which instance of which shard was the primary at the time of the test. Anyway, when replication and write heavy use-cases are involved all mongod instances have to handle the same workload because all nodes replicate the write operations. I suspect that in one case (hash key) your configuration is struggling with context switches because each shard is equally involved. And with range key, when a shard is involve, there is no context switch and the workload is evenly distribute on the hardware.

I am pretty sure you would be better off without sharding.

In principle, you run only 1 instance of mongod on each machine so that it gets all the RAM and CPU to itself. Running multiple instances on the same hardware is detrimental because the cache of each instance is smaller and context switches increase.

To run a 5-shard cluster the minimum number of machines would 5 (shards) * 3 + 3 (config server) + 1 (mongos) = 19 nodes.