Manage the size of Delta tables

Recommendations that can help you manage the size of your Delta tables.

Written by Jose Gonzalez

Last published at: May 23rd, 2022

Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space.

In this article we discuss recommendations that can help you manage the size of your Delta tables.

Enable file system versioning

When you enable file system versioning, you keep multiple variants of your data in the same storage bucket. The file system creates versions of your data, instead of deleting items, which increases the storage space available for your Delta table.

Enable bloom filters

A Bloom filter index (AWS | Azure | GCP) is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file associated with it. Before reading a file Databricks checks the index file and the file is read only if the index indicates that the file might match a data filter.

The size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created and the required false positive probability (FPP). The lower the FPP, the higher the number of used bits per element and the more accurate it will be, at the cost of more storage space.

Review your Delta logRetentionDuration policy

Log files are retained for 30 days by default. This value is configurable through the delta.logRetentionDuration property. You can set a value for this property with the ALTER TABLE SET TBLPROPERTIES SQL method. The more days you retain, the more storage space you consume. For example if you set delta.logRetentionDuration = '365 days' it keeps the log files for 365 days instead of the default of 30 days.

VACUUM your Delta table

VACUUM (AWS | Azure | GCP) removes data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. Files are deleted according to the time they have been logically removed from Delta’s transaction log + retention hours, not their modification timestamps on the storage system. The default threshold is 7 days. Databricks does not automatically trigger VACUUM operations on Delta tables. You must run this command manually. VACUUM helps you delete obsolete files that are no longer needed.

OPTIMIZE your Delta table

The OPTIMIZE (AWS | Azure | GCP) command compacts multiple Delta files into large single files. This improves the overall query speed and performance of your Delta table by helping you avoid having too many small files around. By default, OPTIMIZE creates 1GB files.

Was this article helpful?