Info
This article applies to Databricks Runtime 15.2 and above.
Problem
When working with Delta tables, you notice that your DESCRIBE HISTORY
, DESCRIBE FORMATTED
, and DESCRIBE EXTENDED
queries execute slowly. You may also see bloated Delta logs or driver out-of-memory (OOM) errors.
Cause
Your Delta tables are over-partitioned: you have less than 1 GB of data in a given partition, whether from a single file or multiple small files, but the table can accommodate more.
When a Delta table is divided into too many partitions, each containing a small amount of data, the system's performance can degrade trying to manage the increased number of files and associated overhead.
Solution
Implement liquid clustering to simplify data layout decisions and optimize query performance. Liquid clustering helps distribute data more efficiently and reduce the overhead associated with managing a large number of small partitions.
For more information, please review the Use liquid clustering for Delta tables (AWS | Azure | GCP) documentation.
You can also optimize the table partitioning layout to ensure that each partition contains approximately 1 GB of data or more. Reduce the number of partitions and merge smaller files.
For more information on the ideal partition size, please refer to the When to partition tables on Databricks (AWS | Azure | GCP) documentation.