Vaccuming with zero retention results in data loss

Problem

You add data to a Delta table, but the data disappears without warning. There is no obvious error message.

Cause

This can happen when spark.databricks.delta.retentionDurationCheck.enabled is set to false and VACUUM is configured to retain 0 hours.

VACUUM <name-of-delta-table> RETAIN 0 HOURS

When VACUUM is configured to retain 0 hours it can delete any file that is not part of the version that is being vacuumed. This includes committed files, uncommitted files, and temporary files for concurrent transactions.

Consider the following example timeline:

  • VACUUM starts running at 01:17 UTC on version 100.
  • A data file named sample-data-part-0-1-2.parquet is added to version 101 at 01:18 UTC.
  • Version 101 is committed at 01:19 UTC.
  • VACUUM is still running on version 101 and deletes sample-data-part-0-1-2.parquet at 01:20 UTC.
  • VACUUM completes at 01:22 UTC.

In this example, VACUMM executed on version 100 and deleted everything that was added to version 101.

Solution

  • Databricks recommends that you set a VACUUM retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.
  • Do not set spark.databricks.delta.retentionDurationCheck.enabled to false in your Spark config.
  • If you do set spark.databricks.delta.retentionDurationCheck.enabled to false in your Spark config, you must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.

Review the Databricks VACUMM documentation for more information.