Docs Home → MongoDB Spark Connector
Spark Connector Scala Guide
Note
Source Code
For the source code that contains the examples below, see Introduction.scala.
Prerequisites
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation, Spark documentation, and this MongoDB white paper for more details.
Running MongoDB instance (version 2.6 or later).
Spark 2.4.x.
Scala 2.12.x
Getting Started
Spark Shell
When starting the Spark shell, specify:
the
--packages
option to download the MongoDB Spark Connector package. The following package is available:mongo-spark-connector_2.12
for use with Scala 2.12.x
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.Note
When specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
For example,
./bin/spark-shell --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \ --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \ --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.2
The spark.mongodb.input.uri specifies the MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference.The spark.mongodb.output.uri specifies the MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data. Connects to port27017
by default.The
packages
option specifies the Spark Connector's Maven coordinates, in the formatgroupId:artifactId:version
.
Import the MongoDB Connector Package
Enable MongoDB Connector specific functions and implicits for the
SparkSession
and RDD (Resilient Distributed Dataset) by importing
the following package in the Spark shell:
import com.mongodb.spark._
Connect to MongoDB
Connection to MongoDB happens automatically when an RDD action requires a read from MongoDB or a write to MongoDB.
Self-Contained Scala Application
Dependency Management
Provide the Spark Core, Spark SQL, and MongoDB Spark Connector dependencies to your dependency management tool.
The following excerpt demonstrates how to include these dependencies in
a SBT build.scala
file:
scalaVersion := "2.12.13", libraryDependencies ++= Seq( "org.mongodb.spark" %% "mongo-spark-connector" % "3.0.2", "org.apache.spark" %% "spark-core" % "3.0.2", "org.apache.spark" %% "spark-sql" % "3.0.2" )
Configuration
When specifying the Connector configuration via SparkSession
, you
must prefix the settings appropriately. For details and other
available MongoDB Spark Connector options, see the
Configuration Options.
package com.mongodb object GettingStarted { def main(args: Array[String]): Unit = { /* Create the SparkSession. * If config arguments are passed from the command line using --conf, * parse args for the values to set. */ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .master("local") .appName("MongoSparkConnectorIntro") .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.myCollection") .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection") .getOrCreate() } }
MongoSpark Helper
If you require granular control over your configuration, then the
MongoSpark
companion provides a builder()
method for
configuring all aspects of the Mongo Spark Connector. It also provides
methods to create an RDD, DataFrame
or Dataset
.
Troubleshooting
If you get a java.net.BindException: Can't assign requested address
,
Check to ensure that you do not have another Spark shell already running.
Try setting the
SPARK_LOCAL_IP
environment variable; e.g.export SPARK_LOCAL_IP=127.0.0.1 Try including the following option when starting the Spark shell:
--driver-java-options "-Djava.net.preferIPv4Stack=true"
If you have errors running the examples in this tutorial, you may need
to clear your local ivy cache (~/.ivy2/cache/org.mongodb.spark
and
~/.ivy2/jars
).