Z-Ordering will be ineffective, not collecting stats
Problem
You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns.
AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns.
Note
Please review Z-Ordering (multi-dimensional clustering) for more information on data skipping and z-ordering.
Cause
Delta Lake collects statistics on the first 32 columns defined in your table schema. If the columns you are attempting to Z-Order are not in the first 32 columns, no statistics are collected for those columns.
Solution
Reorder the columns in your table, so the columns you are attempting to Z-Order are in the first 32 columns in your table.
You can use an ALTER TABLE
statement to reorder the columns.
ALTER TABLE table_name CHANGE [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name]
For example, this statement brings the column with <column-name>
to the first column in the table.
ALTER TABLE <delta-table-name> CHANGE COLUMN <column-name> <column-name> <data-type> FIRST
Recompute the statistics after you have reordered the columns in the table.
import com.databricks.sql.transaction.tahoe._
import org.apache.spark.sql.catalyst.TableIdentifier
import com.databricks.sql.transaction.tahoe.stats.StatisticsCollection
val tableName = "<name of table>"
val deltaLog = DeltaLog.forTable(spark, TableIdentifier(tableName))
StatisticsCollection.recompute(spark, deltaLog)
Rerun the Z-Order on the table and it should complete successfully.