Hi everyone,
I’m trying to launch a spark JOB locally that connects to my production Atlas cluster (M20). For testing purposes, I have opened the cluster to the whole network (0.0.0.0).
It seems to connect correctly, in fact when I create the dataframe of a collection and use the “df.printSchema()” method, the collection schema is printed correctly on the screen.
However if I run other commands, such as “df.show()” I get this error of a mongoDB library (the spark connector):
Py4JJavaError: An error occurred while calling o49.showString.
: java.lang.NoSuchMethodError: org.apache.spark.sql.types.StructType.toAttributes()Lscala/collection/immutable/Seq;
at com.mongodb.spark.sql.connector.schema.InternalRowToRowFunction.<init>
...
I’m using:
Spark version: 3.4.1
Scala version: 2.12
Jars passed to spark configuration:
jars = [
“mongo-spark-connector_2.13-10.1.1.jar”,
“mongodb-driver-sync-4.10.0.jar”,
“mongodb-driver-core-4.10.0.jar”,
“bson-4.10.0.jar”,
]
For extreme clarity and trasparency, this is the code:
from pyspark.sql import SparkSession
# Jars to pass to spark configuration through "spark.driver.extraClassPath" property
jars = [
"mongo-spark-connector_2.13-10.1.1.jar",
"mongodb-driver-sync-4.10.0.jar",
"mongodb-driver-core-4.10.0.jar",
"bson-4.10.0.jar",
]
jar_path = "/Users/matt/Downloads"
mongo_jar = ""
for jar in jars:
mongo_jar += jar_path + "/" + jar + ":"
# Create a spark session
uri = "mongodb+srv://<username>:<pwd>@<cluster_network>/<database>"
database = "maps"
collection = "users"
spark = SparkSession.builder \
.appName("MongoDB Spark Connector") \
.config("spark.driver.extraClassPath", mongo_jar) \
.getOrCreate()
# Read data from MongoDB
df = spark.read.format("mongodb") \
.option("connection.uri", uri) \
.option("database", database) \
.option("collection", collection) \
.load()
# Print schema
df.printSchema() #It correctly print schema
# Show rows
df.show() # It throws the error above