You add data to a Delta table, but the data disappears without warning. There is no obvious error message.
This can happen when
spark.databricks.delta.retentionDurationCheck.enabled is set to
VACUUM is configured to retain 0 hours.
VACUUM <name-of-delta-table> RETAIN 0 HOURS
VACUUM is configured to retain 0 hours it can delete any file that is not part of the version that is being vacuumed. This includes committed files, uncommitted files, and temporary files for concurrent transactions.
Consider the following example timeline:
VACUUMstarts running at 01:17 UTC on version 100.
- A data file named sample-data-part-0-1-2.parquet is added to version 101 at 01:18 UTC.
- Version 101 is committed at 01:19 UTC.
VACUUMis still running on version 101 and deletes sample-data-part-0-1-2.parquet at 01:20 UTC.
VACUUMcompletes at 01:22 UTC.
In this example,
VACUMM executed on version 100 and deleted everything that was added to version 101.
- Databricks recommends that you set a
VACUUMretention interval to be at least 7 days, because old snapshots and uncommitted files can still be in use by concurrent readers or writers to the table.
- Do not set
spark.databricks.delta.retentionDurationCheck.enabledto false in your Spark config.
- If you do set
spark.databricks.delta.retentionDurationCheck.enabledto false in your Spark config, you must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table.
Review the Databricks VACUMM documentation for more information.