Docs Home → MongoDB Spark Connector
Spark Connector R Guide
Note
Source Code
For the source code that contains the examples below, see introduction.R.
Prerequisites
Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation, Spark documentation, and this MongoDB white paper for more details.
Running MongoDB instance (version 2.6 or later).
Spark 2.4.x.
Scala 2.12.x
Getting Started
sparkR
Shell
This tutorial uses the sparkR
shell, but the code examples work
just as well with self-contained R applications.
When starting the sparkR
shell, you can specify:
the
--packages
option to download the MongoDB Spark Connector package. The following package is available:mongo-spark-connector_2.12
for use with Scala 2.12.x
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.Note
When specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
For example,
./bin/sparkR --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \ --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \ --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.2
The spark.mongodb.input.uri specifies the MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference.The spark.mongodb.output.uri specifies the MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data. Connects to port27017
by default.The
packages
option specifies the Spark Connector's Maven coordinates, in the formatgroupId:artifactId:version
.
Create a SparkSession
Object
Note
When you start sparkR
you get a SparkSession
object called
spark
by default. In a standalone R application, you need
to create your SparkSession
object explicitly, as show below.
If you specified the spark.mongodb.input.uri
and spark.mongodb.output.uri
configuration options when you
started sparkR
, the default SparkSession
object uses them.
If you'd rather create your own SparkSession
object from within
sparkR
, you can use sparkr.session()
and specify different
configuration options.
my_spark <- sparkR.session( master="local[*]", sparkConfig=list(), appName="my_app" )
You can use a SparkSession
object to write data to MongoDB, read
data from MongoDB, create DataFrames, and perform SQL operations.