Problem
You add data to a Delta table, but the data disappears without warning. There is no obvious error message.
Cause
This can happen when spark.databricks.delta.retentionDurationCheck.enabled is set to false and VACUUM is configured to retain 0 hours.
%sql VACUUM <name-of-delta-table> RETAIN 0 HOURS
OR
%sql VACUUM delta.`<delta_table_path>` RETAIN 0 HOURS
When VACUUM is configured to retain 0 hours it can delete any file that is not part of the version that is being vacuumed. This includes committed files, uncommitted files, and temporary files for concurrent transactions.
Consider the following example timeline:
- VACUUM starts running at 01:17 UTC on version 100.
- A data file named part-<xxxx-xxxx-xxxx-xxxx-xxxx-xxxx.xxx>.snappy.parquet is added to version 101 at 01:18 UTC.
- Version 101 is committed at 01:19 UTC.
- VACUUM is still running which deleted the data file part-<xxxx-xxxx-xxxx-xxxx-xxxx-xxxx.xxx>.snappy.parquet added in version 101 at 01:20 UTC.
- VACUUM completes at 01:22 UTC.
In this example, VACUUM executed on version 100 and deleted everything that was added to version 101.
Solution
- Databricks recommends that you set a VACUUM retention interval to at least 7 days because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.
- Do not set spark.databricks.delta.retentionDurationCheck.enabled to false in your Spark config.
- If you do set spark.databricks.delta.retentionDurationCheck.enabled to false in your Spark config, you must choose an interval that is longer than the longest-running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.
Review the Databricks VACUUM documentation (AWS | Azure | GCP) for more information.