Docs Home → MongoDB Spark Connector
Filters and SQL
Filters
Note
When using filters with DataFrames or the R API, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark.
Use filter()
to read a subset of data from your MongoDB collection.
Consider a collection named fruit
that contains the
following documents:
{ "_id" : 1, "type" : "apple", "qty" : 5 } { "_id" : 2, "type" : "orange", "qty" : 10 } { "_id" : 3, "type" : "banana", "qty" : 15 }
First, set up a dataframe to connect with your default MongoDB data source:
df <- read.df("", source = "com.mongodb.spark.sql.DefaultSource")
Note
The empty argument ("") refers to a file to use as a data source. In this case our data source is a MongoDB collection, so the data source argument is empty.
The following operation filters the data and includes records where the
qty
field is greater than or equal to 10
:
head(filter(df, df$qty >= 10))
The operation prints the following output:
_id qty type 1 2 10 orange 2 3 15 banana
SQL
Before running SQL queries on your dataset, you must register a temporary view for the dataset.
The following example registers a temporary table called temp
,
then uses SQL to query for records in which the type
field
contains the letter e
:
createOrReplaceTempView(df, "temp") some_fruit <- sql("SELECT type, qty FROM temp WHERE type LIKE '%e%'") head(some_fruit)
In the sparkR
shell, the operation prints the following output:
type qty 1 apple 5 2 orange 10