I am working on a spark application that involves reading data from Mongo where the collections may have millions of data with different structures.
Recently, I have migrated from using mongo-spark-connector 2.4.4 to 10.1.1.
The connector 10.1.1 samples 1000 (default) documents to infer the schema for creating the data frame.
The connector 2.4.4 used to have MongoSpark.load() which returns MongoRDD[Document].
While dealing with collections of millions of documents with different structures, using 10.1.1 could lead to data loss as the schema would be inferred using only 1000 docs. We cannot set this value to a million.
Is there a way to overcome this problem? Or is there still a way to get an RDD of documents like it used to have earlier?
The support for RDD is not exposed in the v10.x of Spark Connector Drivers. you still see a dataframe inferred using https://www.mongodb.com/docs/spark-connector/current/batch-mode/batch-read/#schema-inference
You have two options - (a) Explicitly specify the schema instead of relying on the auto-inference, (b) Increase the number of documents that are sampled for auto-inferring the schema (however this will lead to proportional increase in initial time latency) .
sampleSize
The number of documents to sample from the collection when inferring the schema.
1 Like
Thanks @Prakul_Agarwal for the response!
This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.