Z-Ordering will be ineffective, not collecting stats

Z-Ordering is ineffective, error about not collecting stats. Reorder table so the columns you want to optimize on are within the first 32 columns.

Written by mathan.pillai

Last published at: May 10th, 2022

Problem

You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns.

AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns.
Delete

Info

Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more information on data skipping and z-ordering.

Cause

Delta Lake collects statistics on the first 32 columns defined in your table schema. If the columns you are attempting to Z-Order are not in the first 32 columns, no statistics are collected for those columns.

Solution

Reorder the columns in your table, so the columns you are attempting to Z-Order are in the first 32 columns in your table.

You can use an ALTER TABLE statement to reorder the columns.

%sql

ALTER TABLE table_name CHANGE [COLUMN] col_name col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name]

For example, this statement brings the column with <column-name> to the first column in the table.

%sql

ALTER TABLE <delta-table-name> CHANGE COLUMN <column-name> <column-name> <data-type> FIRST

Recompute the statistics after you have reordered the columns in the table.

%scala

import com.databricks.sql.transaction.tahoe._
import org.apache.spark.sql.catalyst.TableIdentifier
import com.databricks.sql.transaction.tahoe.stats.StatisticsCollection

val tableName = "<name of table>"
val deltaLog = DeltaLog.forTable(spark, TableIdentifier(tableName))

StatisticsCollection.recompute(spark, deltaLog)

Rerun the Z-Order on the table and it should complete successfully.