Problem
You have non-Delta tables and Databricks is not running VACUUM
automatically. Symptoms may include the presence of uncommitted files older than the retention threshold. Tables should be vacuumed on the write operation, but uncommitted files are not removed as expected.
Cause
When using non-delta tables, VACUUM
automatically runs at the end of every job and only cleans directories that the particular Apache Spark job touches. If an operation runs on a specific partition, VACUUM
only affects that partition directory, rather than the whole table.
Solution
You should manually run VACUUM
to clear uncommitted files from the entire table.
- Identify the table and partitions that contain dirty data.
-
Run a manual
VACUUM
on the entire table to remove uncommitted files that are older than the retention threshold. The default threshold is 7 days, but it can be adjusted as needed.
VACUUM [table_name] RETAIN [number] HOURS;
For example, to VACUUM
a table named <schema-name>.<table-name>
and retain files for 1 hour, use:
VACUUM <schema-name>.<table-name> RETAIN 1 HOURS;
For more information, please review the VACUUM (AWS | Azure | GCP) documentation.