Problem
When running the OPTIMIZE
command on a Delta table after performing a MERGE INTO
operation, you encounter a divide by zero error.
ArithmeticException: / by zero
JVM stacktrace:
java.lang.ArithmeticException
at com.databricks.sql.io.skipping.KdTreeClustering$.createKdTreeForDf(KdTreeClustering.scala:568)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.buildNewKdTree(KdTreeLoader.scala:241)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.buildNewKdTreeFromClusteredFiles(KdTreeLoader.scala:256)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.$anonfun$load$3(KdTreeLoader.scala:129)
at scala.Option$WithFilter.map(Option.scala:230)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.load(KdTreeLoader.scala:110)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner$.loadKdTree(OptimizeRunner.scala:1567)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.$anonfun$optimize$4(OptimizeRunner.scala:845)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.$anonfun$optimize$4$adapted(OptimizeRunner.scala:770)
at com.databricks.sql.transaction.tahoe.FileMetadataMaterializationTracker$.withTracker(FileMetadataMaterializationTracker.scala:226)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.$anonfun$optimize$1(OptimizeRunner.scala:770)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.withOperationTypeTag(DeltaLogging.scala:199)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.withOperationTypeTag$(DeltaLogging.scala:186)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.withOperationTypeTag(OptimizeRunner.scala:399)
Cause
The OPTIMIZE
command includes soft-deleted files in processing.
Context
When a MERGE INTO
operation deletes a significant number of rows from a Delta table, instead of physically removing deleted records, Delta Lake labels them as deleted. This soft-deletion allows for potential recovery or rollback, which is beneficial for data integrity and version control.
Soft-deletes, though, can cause data staleness in Delta tables because the metadata associated with a table doesn’t reflect its current state. When the OPTIMIZE
command processes these soft-deleted records, it encounters calculations that result in a divide by zero error.
Solution
Physically remove soft-deleted records. Run the REORG TABLE
command with the PURGE
option on your Delta table before running the OPTIMIZE
command.
REORG TABLE <your-table-name> APPLY (PURGE)
Databricks recommends regularly running the REORG TABLE
command with the PURGE
option on your Delta tables to keep them clean and up-to-date.
Databricks also recommends the following additional best practices.
- Monitor the number of deleted rows in your Delta tables and consider archiving or deleting them if they are no longer needed.
- Keep your Databricks environment up-to-date with the latest patches and updates to ensure optimal performance and stability.