Addressing performance issues with over-partitioned Delta tables

Implement liquid clustering for improved performance.

Written by raphael.balogo

Last published at: October 16th, 2024

Info

This article applies to Databricks Runtime 15.2 and above.

 

Problem

When working with Delta tables, you notice that your DESCRIBE HISTORY, DESCRIBE FORMATTED, and DESCRIBE EXTENDED queries execute slowly. You may also see bloated Delta logs or driver out-of-memory (OOM) errors.

Cause

Your Delta tables are over-partitioned: you have less than 1 GB of data in a given partition, whether from a single file or multiple small files, but the table can accommodate more.

When a Delta table is divided into too many partitions, each containing a small amount of data, the system's performance can degrade trying to manage the increased number of files and associated overhead.

Solution

Implement liquid clustering to simplify data layout decisions and optimize query performance. Liquid clustering helps distribute data more efficiently and reduce the overhead associated with managing a large number of small partitions.

For more information, please review the Use liquid clustering for Delta tables (AWSAzureGCP) documentation. 

You can also optimize the table partitioning layout to ensure that each partition contains approximately 1 GB of data or more. Reduce the number of partitions and merge smaller files. 

For more information on the ideal partition size, please refer to the When to partition tables on Databricks (AWSAzureGCP) documentation.