Best practices for dropping a managed Delta Lake table

Learn the best practices for dropping a managed Delta Lake table.

Written by Adam Pavlacka

Last published at: May 10th, 2022

Regardless of how you drop a managed table, it can take a significant amount of time, depending on the data size. Delta Lake managed tables in particular contain a lot of metadata in the form of transaction logs, and they can contain duplicate data files. If a Delta table has been in use for a long time, it can accumulate a very large amount of data.

In the Databricks environment, there are two ways to drop tables (AWS | Azure | GCP):

  • Run DROP TABLE in a notebook cell.
  • Click Delete in the UI.

Even though you can delete tables in the background without affecting workloads, it is always good to make sure that you run DELETE FROM (AWS | Azure | GCP) and VACUUM (AWS | Azure | GCP) before you start a drop command on any table. This ensures that the metadata and file sizes are cleaned up before you initiate the actual data deletion.

For example, if you are trying to delete the Delta table events, run the following commands before you start the DROP TABLE command:

  1. Run DELETE FROM: DELETE FROM events
  2. Run VACUUM with an interval of zero: VACUUM events RETAIN 0 HOURS

These two steps reduce the amount of metadata and number of uncommitted files that would otherwise increase the data deletion time.