Problem
When running Apache Spark jobs that involve data processing with MongoDB, you receive an error message.
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a DoubleType (value: BsonString{value='XX,XXX'})
at com.mongodb.spark.sql.MapFunctions$.convertToDataType(MapFunctions.scala:214)
at com.mongodb.spark.sql.MapFunctions$.$anonfun$documentToRow$1(MapFunctions.scala:37)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
Cause
There is a schema mismatch between the data types the Spark job expects and the actual data types in the MongoDB source.
Specifically in the previous error message, you are attempting to cast a STRING
value in MongoDB into a DoubleType in Spark. ANSI compliance settings in Spark SQL enforce strict type checking and can lead to job failures when data types do not match exactly.
Solution
Validate your source data and update your code to make sure data types match.
If validating and updating are not feasible, you can disable ANSI compliance in Spark SQL to allow more lenient type conversions. Add the following configurations within a notebook cell or to your cluster settings.
spark.conf.set("spark.sql.ansi.enabled","false")
spark.conf.set("spark.sql.storeAssignmentPolicy","LEGACY")
Important
With these configurations, Spark may return null results instead of throwing errors when invalid inputs are encountered in SQL operators or functions, reducing error visibility. Do not use these configurations if data integrity is critical.
For more information, refer to the ANSI compliance in Databricks Runtime (AWS | Azure | GCP) documentation.