Docs Home → MongoDB Spark Connector
Configuration Options
Various configuration options are available for the MongoDB Spark Connector.
Specify Configuration
Via SparkConf
You can specify these options via SparkConf
using the --conf
setting or the $SPARK_HOME/conf/spark-default.conf
file, and
MongoDB Spark Connector will use the settings in SparkConf
as the
defaults.
Important
When setting configurations via SparkConf
, you must prefix the
configuration options. Refer to the configuration sections for the
specific prefix.
Via ReadConfig
and WriteConfig
Various methods in the MongoDB Connector API accept an optional
ReadConfig
or a WriteConfig object.
ReadConfig
and WriteConfig
settings override any
corresponding settings in SparkConf
.
For examples, see Using a ReadConfig
and Using a WriteConfig
. For
more details, refer to the source for these methods.
Via Options Map
In the Spark API, some methods (e.g. DataFrameReader
and
DataFrameWriter
) accept options in the form of a Map[String,
String]
.
You can convert custom ReadConfig
or WriteConfig
settings into
a Map
via the asOptions()
method.
Via System Property
The connector provides a cache for MongoClients
which can only be
configured via the System Property. See Cache Configuration.
Input Configuration
You can configure the following properties to read from MongoDB:
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.
to each property.
Property name | Description |
---|---|
uri | Required.
The connection string in the form
You can append the other remaining input options to the |
database | Required.
The database name to read data from. |
collection | Required.
The collection name to read data from. |
localThreshold | The time in milliseconds to choose among multiple MongoDB servers to send a request. Default: |
readPreference.name | The name of the Read Preference mode to use. Default: Primary |
readPreference.tagSets | The ReadPreference TagSets to use. |
readConcern.level | The Read Concern level to use. |
sampleSize | The sample size to use when analyzing the schema. Default: |
samplePoolSize | The size of the pool to sample from when analyzing the schema. Default: |
partitioner | The name of the partitioner to use to split collection data into
partitions. Partitions are based on a range of values of a field
(e.g. The connector provides the following partitioners:
You can also specify a custom partitioner implementation. For
custom implementations of the To configure options for the various partitioners, see Partitioner Configuration. Default: |
partitionerOptions | The custom options to configure the partitioner. |
registerSQLHelperFunctions | Registers SQL helper functions to allow easy querying of BSON types inside SQL queries. Default: |
sql.inferschema.mapTypes.enabled | Enables you to analyze Default: |
sql.inferschema.mapTypes.minimumKeys | The minimum number of keys a Default: |
sql.pipeline.includeNullFilters | Includes null filters in the aggregation pipeline. |
sql.pipeline.includeFiltersAndProjections | Includes any filters and projections in the aggregation pipeline. |
pipeline | Enables you to apply custom aggregation pipelines to the collection
before sending it to Spark. |
hint | The JSON representation of the hint to use in the aggregation. |
collation | The JSON representation of a collation to use in the aggregation.
The connector creates this with Collation.asDocument.toJson . |
allowDiskUse | Enables writing to temporary files during aggregation. |
batchSize | The size of the internal batches within the cursor. |
Partitioner Configuration
MongoSamplePartitioner
Configuration
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.partitionerOptions.
to each property.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB | The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: |
samplesPerPartition | The number of sample documents to take for each partition in
order to establish a A greater number of NoteFor sampling to improve performance, You can estimate the number of documents within each of your
partitions by dividing your Default: |
Example
For a collection with 640 documents with an average document
size of 0.5 MB, the default MongoSamplePartitioner
configuration
values creates 5 partitions with 128 documents per partition.
The MongoDB Spark Connector samples 50 documents (the default 10
per intended partition) and defines 5 partitions by selecting
partitionKey
ranges from the sampled documents.
MongoShardedPartitioner
Configuration
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.partitionerOptions.
to each property.
Property name | Description |
---|---|
shardkey | The field by which to split the collection data. The field should be indexed. Default: ImportantThis property is not compatible with hashed shard keys. |
MongoSplitVectorPartitioner
Configuration
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.partitionerOptions.
to each property.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB | The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: |
MongoPaginateByCountPartitioner
Configuration
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.partitionerOptions.
to each property.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
numberOfPartitions | The number of partitions to create. A greater number of partitions means fewer documents per partition. Default: |
MongoPaginateBySizePartitioner
Configuration
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.partitionerOptions.
to each property.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB | The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: |
uri
Configuration Setting
You can set all Input Configuration via the input uri
setting.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If you use SparkConf
to set the connector's input configurations,
prefix spark.mongodb.input.
to the setting.
spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
The configuration corresponds to the following separate configuration settings:
spark.mongodb.input.uri=mongodb://127.0.0.1/ spark.mongodb.input.database=databaseName spark.mongodb.input.collection=collectionName spark.mongodb.input.readPreference.name=primaryPreferred
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the input
database for the connection is foobar
:
spark.mongodb.input.uri=mongodb://127.0.0.1/foobar spark.mongodb.input.database=bar
Output Configuration
The following options for writing to MongoDB are available:
Note
If you use SparkConf
to set the connector's output configurations,
prefix spark.mongodb.output.
to each property.
Property name | Description |
---|---|
uri | Required.
The connection string in the form
NoteThe other remaining options may be appended to the |
database | Required.
The database name to write data. |
collection | Required.
The collection name to write data to |
extendedBsonTypes | Enables extended BSON types when writing data to MongoDB. Default: |
localThreshold | The threshold (milliseconds) for choosing a server from multiple MongoDB servers. Default: |
replaceDocument | Replace the whole document when saving Datasets that contain an Default: |
maxBatchSize | The maximum batch size for bulk operations when saving data. Default: |
writeConcern.w | The write concern w value. Default: |
writeConcern.journal | The write concern journal value. |
writeConcern.wTimeoutMS | The write concern wTimeout value. |
shardKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
forceInsert | Forces saves to use inserts, even if a Dataset contains Default: |
ordered | Sets the bulk operations ordered property. Default: |
uri
Configuration Setting
You can set all Output Configuration via the output uri
.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If you use SparkConf
to set the connector's output configurations,
prefix spark.mongodb.output.
to the setting.
spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection
The configuration corresponds to the following separate configuration settings:
spark.mongodb.output.uri=mongodb://127.0.0.1/ spark.mongodb.output.database=test spark.mongodb.output.collection=myCollection
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the output
database for the connection is foobar
:
spark.mongodb.output.uri=mongodb://127.0.0.1/foobar spark.mongodb.output.database=bar
Cache Configuration
The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.
Important
As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.
System Property name | Description |
---|---|
mongodb.keep_alive_ms | The length of time to keep a Default: |