VACUUM best practices on Delta Lake

Learn best practices for using, and troubleshooting, VACUUM on Delta Lake.

Written by mathan.pillai

Last published at: February 3rd, 2023

Why use VACUUM on Delta Lake?

VACUUM is used to clean up unused and stale data files that are taking up unnecessary storage space. Removing these files can help reduce storage costs. 

When you run VACUUM on a Delta table it removes the following files from the underlying file system:

  • Any data files that are not maintained by Delta Lake
  • Removes stale data files (files that are no longer referenced by a Delta table) and are older than 7 days
Delete

Info

VACUUM does NOT remove directories that begin with an underscore, such as _delta_log.

When should you run VACUUM?

When you run VACUUM it removes stale data files. This does not impact regular work, but it can limit your ability to time travel (AWS | Azure | GCP).

The default configuration for a Delta table allows you to time travel 30 days into the past. However, to do this, the underlying data files must be present.

The default configuration for VACUUM deletes stale data files that are older than seven days. As a result, if you run VACUUM with the default settings, you will only be able to time travel seven days into the past, from the time you run VACUUM.

If you do not need to time travel more than seven days into the past, you can VACUUM on a daily basis.

Running VACUUM daily helps keep storage costs in check, especially for larger tables. You can also run VACUUM on-demand if you notice a sudden surge in the storage costs for a specific Delta table. 

Issues you may face with VACUUM

  • No progress update: You may not know how far the VACUUM has completed, especially when VACUUM has run for a long time. You may not know how many files have been successfully removed and how many files remain. 
  • Poor run performance: VACUUM runs for a long time, especially when tables are huge and/or when tables are a source for high frequency input streams.

Mitigate issues with VACUUM

No progress update

If VACUUM completes within an hour or two, there is no need to troubleshoot. However, if VACUUM runs for longer than two hours (this can happen on large tables when VACUUM hasn’t been run recently), you may want to check the progress. In this case you can run VACUUM with the DRY RUN option before and after the actual VACUUM run to monitor the performance of a specific VACUUM run and to identify the number of files deleted.

  1. Run VACUUM DRY RUN to determine the number of files eligible for deletion. Replace <table-path> with the actual table path location.
    %python
    
    spark.sql("VACUUM delta.`<table-path>` DRY RUN")
    The DRY RUN option tells VACUUM it should not delete any files. Instead, DRY RUN prints the number of files and directories that are safe to be deleted. The intention in this step is not to delete the files, but know the number of files eligible for deletion.

    The example DRY RUN command returns an output which tells us that there are files and directories that are safe to be deleted.
    Found x files and directories in a total of y directories that are safe to delete.
    You should record the number of files identified as safe to delete.

  2. Run VACUUM.

  3. Cancel VACUUM after one hour.

  4. Run VACUUM with DRY RUN again.

  5. The second DRY RUN command identifies the number of outstanding files that can be safely deleted.

  6. Subtract the outstanding number of files (second DRY RUN) from the original number of files to get the number of files that were deleted.
Delete

Info

You can also review your storage bucket information in your cloud portal to identify the remaining number of files existing in the bucket, or the number of deletion requests issued, to determine how far the deletion has progressed.

Poor run performance 

This can be mitigated by following VACUUM best practices.

Avoid actions that hamper performance

Avoid over-partitioned data folders

  • Over-partitioned data can result in a lot of small files. You should avoid partitioning on a high cardinality column. When you over-partition data, even running OPTIMIZE can have issues compacting small files, as compaction does not happen across partition directories.
  • File deletion speed is directly dependent on the number of files. Over-partitioning data can hamper the performance of VACUUM.
Delete

Info

You should partition on a low cardinality column and z-order on a high cardinality column.

Avoid concurrent runs

  • When running VACUUM on a large table, avoid concurrent runs (including dry runs).
  • Avoid running other operations on the same location to avoid file system level throttling. Other operations can compete for the same bandwidth.

Avoid cloud versioning

  • Since Delta Lake maintains version history, you should avoid using cloud version control mechanisms, like S3 versioning on AWS.
  • Using cloud version controls in addition to Delta Lake can result in additional storage costs and performance degradation.

Actions to improve performance

Enable autoOptimize/autoCompaction

  • Run OPTIMIZE to eliminate small files. When you combine OPTIMIZE with regular VACUUM runs you ensure the number of stale data files (and the associated storage cost) is minimized.
  • Review the documentation on autoOptimize and autoCompaction (AWS | Azure | GCP) for more information.
  • Review the documentation on OPTIMIZE (AWS | Azure | GCP) for more information.
Delete

Info

Before you modify table properties, you must ensure there are no active writes happening on the table.

Use Databricks Runtime 10.4 LTS or above and additional driver cores (Azure and GCP only)

  • On Azure and GCP VACUUM performs the deletion in parallel on the driver, when using Databricks Runtime 10.4  LTS or above. The higher the number of driver cores, the more the operation can be parallelized. 
Delete

Info

On AWS deletes happen in batches and the process is single threaded. AWS uses a bulk delete API and deletes in batches of 1000, but it doesn’t use parallel threads. As a result, using a multi-core driver may not help on AWS.

Use Databricks Runtime 11.1 or above on AWS

  • Databricks Runtime 11.1 and above set the checkpoint creation interval to 100, instead of 10. As a result, fewer checkpoint files are created. With less checkpoint files to index, the faster the listing time in the transaction log directory. This reduces the delta log size and improves the VACUUM listing time. It also decreases the checkpoint storage size.
  • If you are using Databricks Runtime 10.4 LTS on AWS and cannot update to a newer runtime, you can manually set the table property with delta.checkpointInterval=100. This creates checkpoint files for every 100 commits, instead of every 10 commits.
    %sql
    
    alter table <delta-table-name> set tblproperties ('delta.checkpointInterval' = 100)
Delete

Info

Reducing the number of checkpoints on Databricks Runtime 10.4 LTS may degrade table query/read performance, though in most cases the difference should be negligible. Before you modify table properties, you must ensure there are no active writes happening on the table.

Use compute optimized instances

  • Since VACUUMis compute intensive, you should use compute optimized instances.
    • On AWS use C5 series worker types.
    • On Azure use F series worker types.
    • On GCP use C2 series worker types.

Use auto-scaling clusters

  • Before performing file deletion, VACUUM command lists the files. File listing happens in parallel by leveraging the workers in the cluster. Having more workers in the cluster can help with the initial listing of files. The higher the number of workers, the faster the file listing process.
  • Additional workers are NOT needed for file deletion. This is why you should use an auto-scaling cluster with multiple workers. Once the file listing completes, the cluster can scale down and use the driver for the file deletion. This saves cluster costs.

Review the documentation on how to enable and configuring autoscaling (AWS | Azure | GCP) for more information.

Set a higher trigger frequency for streaming jobs

  • Use a trigger frequency of 120 seconds or more for streaming jobs that write to Delta tables. You can adjust this based on your needs.
    // ProcessingTime trigger with 120 seconds micro-batch interval
    
    df.writeStream
      .format("console")
      .trigger(Trigger.ProcessingTime("120 seconds"))
      .start()
  • The higher the trigger frequency, the bigger the data files. The bigger the data files, the lesser the number of total files. The lesser the number of total files, the less time it takes to delete files. As a result, future VACUUM attempts run faster.

Reduce log retention

  • If you do not need to time travel far into the past, you can reduce log retention to seven days. This reduces the number of JSON files and thereby reduces the listing time. This also reduces the delta log size.
  • The delta.logRetentionDuration property configures how long you can go back in time. The default value is 30 days. You need to use ALTER TABLEto modify existing property values.
    %sql
    
    ALTER TABLE <table-name>
    SET TBLPROPERTIES ('delta.logRetentionDuration'='7 days')
Delete

Info

Before you modify table properties, you must ensure there are no active writes happening on the table.

Run VACUUM daily

  • If you reduce log retention to seven days (thereby limiting time travel to seven days) you can run VACUUM on a daily basis.
  • This deletes stale data files that are older than sever days, every day. This is a good way to avoid stale data files and reduce you storage costs.
Delete

Warning

If the Delta table is the source for a streaming query, and if the streaming query falls behind by more than seven days, then the streaming query will not be able to correctly read the table as it will be looking for data that has already been deleted. You should only run a daily VACUUM if you know that all queries will never ask for data that is more than seven days old.

  • After testing and verification on a small table, you can schedule VACUUM to run everyday via a job.
  • Schedule VACUUM to run using a job cluster, instead of running it manually on all-purpose clusters, which may cost more.
  • Use auto-scaling cluster when configuring the job to save costs.

Summary

To improve VACUUM performance:

  • Avoid over-partitioned directories
  • Avoid concurrent runs (during VACUUM)
  • Avoid enabling cloud storage file versioning
  • If you run a periodic OPTIMIZE command,  enable autoCompaction/autoOptimize on the delta table
  • Use a current Databricks Runtime
  • Use auto-scaling clusters with compute optimized worker types

In addition, if your application allows for it:

  • Increase the trigger frequency of any streaming jobs that write to your Delta table
  • Reduce the log retention duration of the Delta table
  • Perform a periodic VACUUM

These additional steps further increase VACUUM performance and can also help reduce storage costs.