Vaccuming with zero retention results in data loss
Problem
You add data to a Delta table, but the data disappears without warning. There is no obvious error message.
Cause
This can happen when spark.databricks.delta.retentionDurationCheck.enabled
is set to false
and VACUUM
is configured to retain 0 hours.
VACUUM <name-of-delta-table> RETAIN 0 HOURS
When VACUUM
is configured to retain 0 hours it can delete any file that is not part of the version that is being vacuumed. This includes committed files, uncommitted files, and temporary files for concurrent transactions.
Consider the following example timeline:
VACUUM
starts running at 01:17 UTC on version 100.- A data file named sample-data-part-0-1-2.parquet is added to version 101 at 01:18 UTC.
- Version 101 is committed at 01:19 UTC.
VACUUM
is still running on version 101 and deletes sample-data-part-0-1-2.parquet at 01:20 UTC.VACUUM
completes at 01:22 UTC.
In this example, VACUMM
executed on version 100 and deleted everything that was added to version 101.
Solution
- Databricks recommends that you set a
VACUUM
retention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table. - Do not set
spark.databricks.delta.retentionDurationCheck.enabled
to false in your Spark config. - If you do set
spark.databricks.delta.retentionDurationCheck.enabled
to false in your Spark config, you must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.
Review the Databricks VACUMM documentation for more information.