Problem
You try to run an Apache Spark job that performs multiple data transformations. The job fails with the following error message.
java.lang.NumberFormatException: Character N is neither a decimal digit number, decimal point, nor "e" notation exponential mark.
Reviewing the full stack trace shows that the error originates in the StatsEstimation
component of the Spark Catalyst optimizer.
at java.base/java.math.BigDecimal.<init>(BigDecimal.java:586)
at java.base/java.math.BigDecimal.<init>(BigDecimal.java:916)
at scala.math.BigDecimal$.decimal(BigDecimal.scala:53)
at scala.math.BigDecimal$.decimal(BigDecimal.scala:56)
at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:344)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinaryForNumeric(FilterEstimation.scala:989)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:672)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:509)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateFilterSelectivity(FilterEstimation.scala:437)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateFilterSelectivity(FilterEstimation.scala:374)
at com.databricks.sql.optimizer.OptimizeLimit.com$databricks$sql$optimizer$OptimizeLimit$$prefetchPartitionsWithinLimit(OptimizeLimit.scala:444)
at com.databricks.sql.optimizer.OptimizeLimit$$anonfun$rewrite$2.applyOrElse(OptimizeLimit.scala:404)
at com.databricks.sql.optimizer.OptimizeLimit$$anonfun$rewrite$2.applyOrElse(OptimizeLimit.scala:398)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:521)
Cause
The Spark Catalyst optimizer evaluates numeric expressions as part of its cost-based optimization (CBO) during query planning. During cost-based optimization, the Spark Catalyst optimizer attempts to evaluate selectivity for filter conditions using column-level statistics.
When these statistics include invalid or improperly formatted values, the optimizer tries to parse non-numeric strings where numeric values are expected, which is invalid. As a result, the optimizer's numeric estimation routines—such as those leveraging BigDecimal
— throw a NumberFormatException
.
Additional context
Invalid or improperly formatted statistics metadata values occur when statistics are either:
- Automatically collected and incorrectly inferred due to data skew or mixed-type values in a column.
- Manually injected or derived from legacy metadata with incompatible formatting.
- Persisted through schema evolution or partial overwrite of Delta or Parquet partitions without re-analyzing the table.
Solution
You can either temporarily disable CBO or refresh the column statistics to correct the metadata.
Temporarily disable CBO
Disabling CBO bypasses the optimizer logic that relies on column statistics, effectively preventing Spark from executing the faulty estimation logic. Apply the following Spark setting at the cluster level.
spark.sql.cbo.enabled False
For details on how to apply Spark settings, refer to the “Spark configuration” section of the Compute configuration reference (AWS | Azure | GCP) documentation.
You can also disable the CBO from a notebook. Run the following command in a cell at the top of your notebook before running other queries.
spark.conf.set("spark.sql.cbo.enabled","False")
Refresh column statistics to correct metadata
To restore valid column statistics and allow safe use of CBO, recompute the statistics for the affected table(s). In a cell at the top of a notebook or in the first lines of the SQL editor, run either of the following SQL commands, depending on whether you want to update specific columns or all columns.
Run the following command to update specific columns.
ANALYZE TABLE <table-name> COMPUTE STATISTICS FOR COLUMNS <column1>, <column2>, ...
Run the following command to instead update all columns.
ANALYZE TABLE <table-name> COMPUTE STATISTICS FOR ALL COLUMNS
ANALYZE TABLE
triggers Spark to scan the data and compute min and max values, null counts, distinct counts, and histograms (if enabled) for each specified column. The updated metadata ensures that subsequent optimizer runs use valid statistics for plan estimation.