Problem
You observe queries using the MAX()
function on Delta tables with string or timestamp columns perform slowly.
Cause
Metadata optimizations are not applied for string or timestamp data types by default, leading to full table scans instead of leveraging Delta Lake statistics for performance improvements.
Additional context
Delta Lake uses collected statistics to optimize queries by skipping data based on min/max values. However, for string and timestamp columns, Delta collects statistics only on the first 32 characters (for strings), which may not represent the full range of values in those columns.
Consequently, the Delta Lake optimizer skips these columns when evaluating whether to perform metadata-based query optimizations (like data skipping or statistics pruning).
Solution
Enable metadata query optimizations for these column types by setting the following Apache Spark configuration. Run the following code in a notebook.
SET spark.databricks.delta.optimizeMetadataQuery.clusteredTable.enabled = TRUE;
This setting allows the Delta optimizer to use file-level metadata for string or timestamp columns when the table is clustered, even though full statistics are not available.