Is there an ETA on when the Spark Connector will be working with Spark 3.5?

Thomas_Hackman · February 19, 2024, 10:09pm

Our goal is to push rows from our Databricks cluster into Mongo Atlas so we can use mongo as a serving layer. We just want to take one table and upsert any changes into a mongo collection. Fyi: We do use a lot of structs and array.

We’ve tried all of the examples from the mongo documentation (https://www.mongodb.com/docs/spark-connector/current/ ) and none of it has worked due to version compatibility issues. I could post all of the code that we’ve tried but it all came verbatim from the mongo docs.

We are running Databricks 14.1 because we need some of the preview features that they have released in the past year. Our Databricks cluster is setup as Databricks 14.1 (includes Apache Spark 3.5.0, Scala 2.12) . There isn’t any way for us to use an earlier version of Spark.

We have it setup to install this connector library on startup: org.mongodb.spark:mongo-spark-connector_2.12:10.2.1

We have all of the firewall and authentication issues solved. We can successfully connect to mongo from a databricks notebook but we can’t do anything with data. The only call that we’ve been able to get working is df.printSchema(). And that does successfully talk to spark and bring back the document schema.

Any attempt to read or write document content fails.
Read attempt:

df = spark.read.format("mongodb").option("database", "sample_supplies").option("spark.mongodb.read.connection.uri", connectionString).option("collection","sales").load()
display(df) 

java.lang.NoClassDefFoundError: Could not initialize class com.mongodb.spark.sql.connector.read.MongoScanBuilder

Write attempt:

dataFrame = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)], ["name", "age"])

dataFrame.write.format("mongodb").mode("append").option("spark.mongodb.write.connection.uri", connectionString).option("database", "people").option("collection", "contacts").save()

org.apache.spark.SparkException: Writing job failed.
Job aborted due to stage failure: Task 1 in stage 9.0 failed 4 times, most recent failure: Lost task 1.3 in stage 9.0 (TID 90) (10.139.64.11 executor 0): java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/catalyst/encoders/ExpressionEncoder;
at com.mongodb.spark.sql.connector.schema.InternalRowToRowFunction.<init>(InternalRowToRowFunction.java:44)
at com.mongodb.spark.sql.connector.schema.RowToBsonDocumentConverter.<init>(RowToBsonDocumentConverter.java:84)
at com.mongodb.spark.sql.connector.write.MongoDataWriter.<init>(MongoDataWriter.java:74)
at com.mongodb.spark.sql.connector.write.MongoDataWriterFactory.createWriter(MongoDataWriterFactory.java:53)

I saw a jira ticket (SPARK-413) and a PR for the issues with InternalRowToRowFunction but it wasn’t clear if that was a very specific point fix or if all read and write operations will be working after the next connector update.

Is there any ETA on when the Spark Connector will be working with Spark 3.5?

I also found this post on Medium where someone forked the connector and fixed some of the issues. I wasn’t sure if this would be helpful to the devs. Enhanced MongoDB Connector for Spark 3.5 | by Kürşat Kurt | xWorks Technology Blog

I would include more screenshots and links but as a new user the forum is only allowing one link and one image.

Prakul_Agarwal · February 20, 2024, 7:50pm

Hi @Thomas_Hackman , The https://jira.mongodb.org/browse/SPARK-413 is slated for the next release, 10.3.0, which is scheduled for the current quarter. The PR that you have referenced is community-contributed and is not necessarily a complete solution. Our team will be going through the various scenarios for full support of the new versions of Spark.

Thomas_Hackman · February 26, 2024, 4:07pm

Hi @Prakul_Agarwal , thanks for the response. Can you clarify your last sentence. Will all of the work be complete before the next release (10.3) or are you saying that only spark-413 pr (a small patch) will be in release 10.3 and that there is a lot more work to be done and Our team will be going through the various scenarios for full support of the new versions of Spark. for a future release?

Prakul_Agarwal · February 29, 2024, 9:38am

We will come out with full support for a few new versions of Spark in MongoDB Spark Connector v10.3

Yaar_BDBQ · March 5, 2024, 8:59am

Hey @Prakul_Agarwal
We need the connector’s support for Spark 3.5.0, and any delay of the ETA (“…the current quarter.”) will be problematic for us and our clients.

As Spark 3.5.1 just came out (February 23rd according to the Downloads page), will that cause a delay for the new version’s release or does the ETA mentioned is still relevant?
Thank you

Sean_Smith · March 11, 2024, 4:00pm

Hey Thomas,

We ran into this same issue last week and found out that either the Databricks implementation of the driver or the driver itself, or the documentation has a typo. Try ‘mongo’ instead of ‘mongodb’:

dataFrame.write.format("mongo").mode("append")...

Thomas_Hackman · March 13, 2024, 7:22pm

Hi Sean,
I agree that the Databricks docs are outdated and mention “mongo” but I that my team is running into.

dataFrame.write.format(“mongo”).mode(“append”)…
This call is depricated. It is the old syntax that uses the 3.x connector. The docs from Mongo are here: https://www.mongodb.com/docs/spark-connector/v3.0/python/read-from-mongodb/

We are using: dataFrame.write.format(“mongodb”).mode(“append”)…
Because that call uses the new connector that has better support for spark streaming.
The announcement for the change is here.
Introducing the Newest Version of the MongoDB Spark Connector | MongoDB Blog