For any MongoDB deployment, the Mongo Spark Connector sets the preferred location for an RDD to be where the data is:

  • For a non sharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.

  • For a sharded system, it sets the preferred location to be the hostname(s) of the shards.

To promote data locality,

Spark streams can be considered as a potentially infinite source of RDDs. Therefore, anything you can do with an RDD, you can do with the results of a Spark Stream.

For an example, see SparkStreams.scala

In MongoDB deployments with mixed versions of mongod, it is possible to get an Unrecognized pipeline stage name: '$sample' error. To mitigate this situation, explicitly configure the partitioner to use and define the Schema when using DataFrames.

Some custom MongoDB BSON types, such as ObjectId, are unsupported in Spark.

The MongoDB Spark Connector converts custom MongoDB data types to and from extended JSON-like representations of those data types that are compatible with Spark. See DataTypes for a list of custom MongoDB types and their Spark counterparts.

To create a standard Dataset with custom MongoDB data types, use fieldTypes helpers:

import com.mongodb.spark.sql.fieldTypes
case class MyData(id: fieldTypes.ObjectId, a: Int)
val ds = spark.createDataset(Seq(MyData(fieldTypes.ObjectId(new ObjectId()), 99)))

The preceding example creates a Dataset containing the following fields and data types:

  • The id field is a custom MongoDB BSON type, ObjectId, defined by fieldTypes.ObjectId.

  • The a field is an Int, a data type available in Spark.

To create a DataFrame with custom MongoDB data types, you must supply those types when you create the RDD and schema:

  • Create RDDs using custom MongoDB BSON types (e.g. ObjectId). The Spark Connector handles converting those custom types into Spark-compatible data types.

  • Declare schemas using the StructFields helpers for data types that are not natively supported by Spark (e.g. StructFields.objectId). Refer to DataTypes for the mapping between BSON and custom MongoDB Spark types.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
import com.mongodb.spark.sql.helpers.StructFields
val data = Seq(Row(Row(new ObjectId().toHexString()), 99))
val rdd = spark.sparkContext.parallelize(data)
val schema = StructType(List(StructFields.objectId("id", true), StructField("a", IntegerType, true)))
val df = spark.createDataFrame(rdd, schema)

The preceding example creates a DataFrame containing the following fields and data types:

  • The id field is a custom MongoDB BSON type, ObjectId, defined by StructFields.objectId.

  • The a field is an Int, a data type available in Spark.

