First off: thank you both for your time and effort in helping a MongoDB beginner get his bearings! I really appreciate it.
As for the index, I think you incur processing overhead by making it a hashed index. I think mongos will rehash your hash value before distributing it. I think you are better off having a normal index on your hash field or forgo the hash field altogether and make hash index on your userid, teamnumber fields. The former will save processing time but the latter will save space. Space saving is often saving time too as there is less data to move and to read/write.
It was my understanding that there can only be one hashed key, so only userID or TeamNumber can be the hash. Right? So I’d create a hashed index like this:
db.collection.createIndex( { "userName" : "hashed", "teamNumber" : 1 } )
or alternately, I could just create an index on
db.collection.createIndex( { "hash" : 1} )
then shard on it
sh.shardCollection("database.collection", { "hash" : 1, } )
Right?
Is there a reason for you to start with a sharded cluster which is arguably the most complex deployment method that requires deeper knowledge? Is the workload cannot be served by using a replica set with larger machines instead of a sharded cluster with smaller machines?
Because I was interested in using sharding and replica set functionality of MongoDB. I already had a working set up with SQL Server, so I wanted to investigate the performance characteristics of a multi-shard NoSQL solution. I freely admit to being a NoSQL beginner but standing up a single server creates an environment exactly the same as a MySQL/MSSQL/Postgres deployment, albeit with a NoSQL twist and teaching myself the complexities of a sharded deployment while trying to get a more performant solution is the goal here.
Since you didn’t mention about the sharded collection setup and you mentioned that turning off the balancer sped things up, what I’m guessing happened is that the chunks was not pre-distributed into the shards before import, thus the import is happening at the same time as the cluster constantly trying to rebalance itself, thus slowing down the import. In a sharded collection, the recommended approach is to pre-distribute the chunks before importing (see Create Chunks in a Sharded Cluster).
You are correct. I did not predeploy chunks. I’ll be reading up on chunks now, thank you.
As @steevej mentioned earlier, the recommended setup is 3 node replica set, for many reasons (high availability, majority write, majority read, sharded transaction support, etc.). Unless you can avoid it, having arbiters in a replica set is not a recommended setup, or unless you fully understand the implications of using arbiters entails.
I was trying to cut down on the number of instances by having a 2-note RS instead of three but I understand now that three is the way to go.
I guess my question is twofold: First, have you considered using a single replica set which will more closely mimic your monolithic SQL Server setup, and second, if you really need the performance of a sharded cluster and the complexity that comes with it, have you considered using Atlas so all of this operational concerns are managed for you?
With all of these posts, I think I’m going to cancel my current upload effort and work with a single mongod replica set to see how it performs. Then if I can get it to perform near to what SQL Server does I’ll move on to a sharded deployment.
And the reason why I’m not using Atlas is because I want to learn about those operational concerns to better understand the platform from an architecture perspective. I’m not using Atlas because I get AWS access for free so I’m using EC2 instances for my build.