Spark Connector R Guide

This version of the documentation is archived and no longer supported. See the current documentation for the latest version of the MongoDB Connector for Spark.

Note

Source Code

For the source code that contains the examples below, see introduction.R.

Prerequisites

Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation, Spark documentation, and this MongoDB white paper for more details.
Running MongoDB instance (version 2.6 or later).
Spark 2.4.x.
Scala 2.12.x

Getting Started

`sparkR` Shell

This tutorial uses the sparkR shell, but the code examples work just as well with self-contained R applications.

When starting the sparkR shell, you can specify:

the --packages option to download the MongoDB Spark Connector package. The following package is available:
- mongo-spark-connector_2.12 for use with Scala 2.12.x
the --conf option to configure the MongoDB Spark Connnector. These settings configure the SparkConf object.
Note
When specifying the Connector configuration via SparkConf, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.

For example,

./bin/sparkR  --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
              --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \
              --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.2

The spark.mongodb.input.uri specifies the MongoDB server address (127.0.0.1), the database to connect (test), and the collection (myCollection) from which to read data, and the read preference.
The spark.mongodb.output.uri specifies the MongoDB server address (127.0.0.1), the database to connect (test), and the collection (myCollection) to which to write data. Connects to port 27017 by default.
The packages option specifies the Spark Connector's Maven coordinates, in the format groupId:artifactId:version.

Create a `SparkSession` Object

Note

When you start sparkR you get a SparkSession object called spark by default. In a standalone R application, you need to create your SparkSession object explicitly, as show below.

If you specified the spark.mongodb.input.uri and spark.mongodb.output.uri configuration options when you started sparkR, the default SparkSession object uses them. If you'd rather create your own SparkSession object from within sparkR, you can use sparkr.session() and specify different configuration options.

my_spark <- sparkR.session(
 master="local[*]",
 sparkConfig=list(),
 appName="my_app"
)

You can use a SparkSession object to write data to MongoDB, read data from MongoDB, create DataFrames, and perform SQL operations.

Tutorials

← Filters and SQL Write to MongoDB →

Spark Connector R Guide

Note

Source Code

Prerequisites

Getting Started

sparkR Shell

Note

Create a SparkSession Object

Note

Tutorials

`sparkR` Shell

Create a `SparkSession` Object