Docs Menu

FAQ

For any MongoDB deployment, the Spark Connector sets the preferred location for a DataFrame or Dataset to be where the data is.

  • For a nonsharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.

  • For a sharded system, it sets the preferred location to be the hostname(s) of the shards.

To promote data locality, we recommend taking the following actions:

  • Ensure there is a Spark Worker on one of the hosts for nonsharded system or one per shard for sharded systems.

  • Use a nearest read preference to read from the local mongod.

  • For a sharded cluster, have a mongos on the same nodes and use the localThreshold configuration setting to connect to the nearest mongos. To partition the data by shard use the ShardedPartitioner Configuration.

In MongoDB deployments with mixed versions of mongod, it is possible to get an Unrecognized pipeline stage name: '$sample' error. To mitigate this situation, explicitly configure the partitioner to use and define the schema when using DataFrames.

To use mTLS, include the following options when you run spark-submit:

--driver-java-options -Djavax.net.ssl.trustStore=<path to your truststore.jks file> \
--driver-java-options -Djavax.net.ssl.trustStorePassword=<your truststore password> \
--driver-java-options -Djavax.net.ssl.keyStore=<path to your keystore.jks file> \
--driver-java-options -Djavax.net.ssl.keyStorePassword=<your keystore password> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=<path to your truststore.jks file> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStorePassword=<your truststore password> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStore=<path to your keystore.jks file> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStorePassword=<your keystore password> \

The MongoConnector includes a cache that lets workers share a single MongoClient across threads. To specify the length of time to keep a MongoClient available, include the mongodb.keep_alive_ms option when you run spark-submit:

--driver-java-options -Dmongodb.keep_alive_ms=<number of milliseconds to keep MongoClient available>

By default, this property has a value of 5000.

Note

Because the cache is set up before the Spark Configuration is available, you must use a system property to configure it.