Divide by zero arithmetic exception when running the OPTIMIZE command on a Delta table

Use the REORG TABLE command and then OPTIMIZE to clean up soft-deleted records.

Written by Vidhi Khaitan

Last published at: November 26th, 2024

Problem

When running the OPTIMIZE command on a Delta table after performing a MERGE INTO operation, you encounter a divide by zero error. 

 

ArithmeticException: / by zero

JVM stacktrace:
java.lang.ArithmeticException
at com.databricks.sql.io.skipping.KdTreeClustering$.createKdTreeForDf(KdTreeClustering.scala:568)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.buildNewKdTree(KdTreeLoader.scala:241)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.buildNewKdTreeFromClusteredFiles(KdTreeLoader.scala:256)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.$anonfun$load$3(KdTreeLoader.scala:129)
at scala.Option$WithFilter.map(Option.scala:230)
at com.databricks.sql.io.skipping.liquid.KdTreeLoader.load(KdTreeLoader.scala:110)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner$.loadKdTree(OptimizeRunner.scala:1567)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.$anonfun$optimize$4(OptimizeRunner.scala:845)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.$anonfun$optimize$4$adapted(OptimizeRunner.scala:770)
at com.databricks.sql.transaction.tahoe.FileMetadataMaterializationTracker$.withTracker(FileMetadataMaterializationTracker.scala:226)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.$anonfun$optimize$1(OptimizeRunner.scala:770)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.withOperationTypeTag(DeltaLogging.scala:199)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.withOperationTypeTag$(DeltaLogging.scala:186)
at com.databricks.sql.transaction.tahoe.commands.optimize.OptimizeRunner.withOperationTypeTag(OptimizeRunner.scala:399)

 

Cause

The OPTIMIZE command includes soft-deleted files in processing. 

 

Context

When a MERGE INTO operation deletes a significant number of rows from a Delta table, instead of physically removing deleted records, Delta Lake labels them as deleted. This soft-deletion allows for potential recovery or rollback, which is beneficial for data integrity and version control.  

 

Soft-deletes, though, can cause data staleness in Delta tables because the metadata associated with a table doesn’t reflect its current state. When the OPTIMIZE command processes these soft-deleted records, it encounters calculations that result in a divide by zero error.
 

Solution

Physically remove soft-deleted records. Run the REORG TABLE command with the PURGE option on your Delta table before running the OPTIMIZE command. 

 

REORG TABLE <your-table-name> APPLY (PURGE)

 

Databricks recommends regularly running the REORG TABLE command with the PURGE option on your Delta tables to keep them clean and up-to-date.

 

Databricks also recommends the following additional best practices.  

  • Monitor the number of deleted rows in your Delta tables and consider archiving or deleting them if they are no longer needed. 
  • Keep your Databricks environment up-to-date with the latest patches and updates to ensure optimal performance and stability.