Automatic VACUUM on write does not work with non-Delta tables

Manually run VACUUM to clear uncommitted files from the entire table.

Written by nikhil.jain

Last published at: September 12th, 2024

Problem

You have non-Delta tables and Databricks is not running VACUUM automatically. Symptoms may include the presence of uncommitted files older than the retention threshold. Tables should be vacuumed on the write operation, but uncommitted files are not removed as expected.

Cause

When using non-delta tables, VACUUM automatically runs at the end of every job and only cleans directories that the particular Apache Spark job touches. If an operation runs on a specific partition, VACUUM only affects that partition directory, rather than the whole table.

Solution

You should manually run VACUUM to clear uncommitted files from the entire table.

  1. Identify the table and partitions that contain dirty data.
  2. Run a manual VACUUM on the entire table to remove uncommitted files that are older than the retention threshold. The default threshold is 7 days, but it can be adjusted as needed.

VACUUM [table_name] RETAIN [number] HOURS;

For example, to VACUUM a table named <schema-name>.<table-name> and retain files for 1 hour, use:

VACUUM <schema-name>.<table-name> RETAIN 1 HOURS;

For more information, please review the VACUUM (AWSAzureGCP) documentation.