I am using Java mongo-spark-connector to query a collection. The collection has a field that is defined as TImestamp. I can read this collection properly using mongo-java-driver. WHen I try to create a RDD from this collection using Spark. I get an error BsonTimestamp is not serializable.
Is there a workaround for this problem?
BsonTimestamp
class in the MongoDB Java driver is not marked as serializable. When Spark tries to serialize objects to distribute them across the cluster, it requires that all objects are serializable.
By converting the BsonTimestamp objects to a serializable format before creating the RDD, you can avoid the serialization error and process the collection using Spark.
Let us know if this helps
2 Likes
The exception is thrown before I have access to Document. Is there a way to intercept, or add my own Serializer in pipeline before it is read.
Try something on this line
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.bson.BsonTimestamp;
import java.util.Date;
public class MongoSparkConnectorExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MongoSparkConnectorExample").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
// Your existing code to read the collection using mongo-spark-connector
// ...
// Perform the transformation to convert BsonTimestamp to Date
JavaRDD<Document> transformedRDD = mongoRDD.map(document -> {
BsonTimestamp bsonTimestamp = (BsonTimestamp) document.get("timestamp");
Date timestamp = new Date(bsonTimestamp.getTime() * 1000L);
document.put("timestamp", timestamp);
return document;
});
// Continue working with the transformed RDD
// ...
}
}