I am very new to Spark. Could someone please help me understand how to install the missing Bson reference? I couldn’t see this in the sample code or MongoDB PySpark documentation.
Hi Robert, thank you for your reply. My apologies for not getting back to you earlier, I had forgotten about this post.
Thanks for the link to your Docker image, i’ll take a look. Do you have any instructions on how to setup all the dependencies? I have been through the MongoDB Spark documentation and couldn’t find a workable solution.
I don’t remember exactly what the solution was, but I think it might have been an issue with my environment. I would try a clean installation if you can. If you are still having issues, contact me back and i’ll share some pyspark with a mongodb connection and commands for how I submit to the cluster.
Hi Ben
I am using AWS EMR instance where i installed mongodb 6.
I am using spark 3 up. I have used mongodb-spark connectors as provided by mongodb.
I tried all different option that is availabel in documents. But not luck.
I am trying to connect from notebook
Can you please help
Thanks
Saswata
I’m not familiar with AWS EMR so probably not much help to you. The only thing I can think of, is when I submit a job to the cluster I have to specify what packages to load. For example, this is the command I execute:
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 --driver-memory 6G --master spark://192.168.1.13:7077 ./some_pyspark.py
Is it possible that when you execute the notebook, it isn’t including the mongodb packages? Are you able to validate your solution outside of AWS (ie a locally installed cluster & mongodb instance)?