Job failures when running Apache Spark jobs processing MongoDB data

Validate your source data to make sure data types match, or disable ANSI compliance in Spark SQL.

Written by manikandan.ganesan

Last published at: January 17th, 2025

Problem

When running Apache Spark jobs that involve data processing with MongoDB, you receive an error message. 

 

com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a DoubleType (value: BsonString{value='XX,XXX'})
at com.mongodb.spark.sql.MapFunctions$.convertToDataType(MapFunctions.scala:214)
at com.mongodb.spark.sql.MapFunctions$.$anonfun$documentToRow$1(MapFunctions.scala:37)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)

 

Cause

There is a schema mismatch between the data types the Spark job expects and the actual data types in the MongoDB source. 

Specifically in the previous error message, you are attempting to cast a STRING value in MongoDB into a DoubleType in Spark. ANSI compliance settings in Spark SQL enforce strict type checking and can lead to job failures when data types do not match exactly.

 

Solution

Validate your source data and update your code to make sure data types match.

If validating and updating are not feasible, you can disable ANSI compliance in Spark SQL to allow more lenient type conversions. Add the following configurations within a notebook cell or to your cluster settings.  

 

spark.conf.set("spark.sql.ansi.enabled","false")
spark.conf.set("spark.sql.storeAssignmentPolicy","LEGACY")

 

Important

With these configurations, Spark may return null results instead of throwing errors when invalid inputs are encountered in SQL operators or functions, reducing error visibility. Do not use these configurations if data integrity is critical. 

 

 

For more information, refer to the ANSI compliance in Databricks Runtime (AWS | Azure | GCP) documentation.