I have not tried to replicate the same using different driver. However, I had used pymongo to perform the same operation with 1 client thread and that seems to work okay.
That indicates Mongo server might be overwhelmed which ultimately caused the increased in latency. Although, I’m still puzzled with 20k (ms) latency when handling 256 concurrent clients.