Problem
You are trying to migrate to serverless compute but you are encountering an issue with the Apache Spark delta.retentionDurationCheck
property not working correctly.
For example, this sample code snippet does not work when you are using serverless compute:
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled=false")
spark.sql("VACUUM <catalog>.<schema>.<table-name> RETAIN 24 HOURS")
You want to maintain the retention rules that you have defined in your notebook and use serverless compute.
Cause
Databricks serverless compute does not support certain Spark properties, including spark.databricks.delta.retentionDurationCheck.enabled
.
Serverless architecture is designed to optimize resource usage and scalability, but it also restricts certain configurations that are available with standard compute.
Users cannot directly set these properties in serverless environments, which can be a challenge when maintaining specific retention rules for Delta tables.
Solution
Use VACUUM instead of disabling the retention check
For example, if you want to use the recommended retention period of 7 days, use the following command in your notebook:
spark.sql("VACUUM <catalog>.<schema>.<table-name> RETAIN 168 HOURS") # 168 hours = 7 days
For more information, review the Remove unused data files with vacuum (AWS | Azure) documentation.
Use table properties to set a specific retention period, especially if you require a short duration
Use the following command in your notebook:
ALTER TABLE <catalog>.<schema>.<table-name> SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 24 hours')
For a list of supported Spark configuration parameters in serverless clusters, review the Serverless compute release notes (AWS | Azure).
You can effectively manage retention rules in serverless compute without relying on an unsupported Spark configuration.