The delta.retentionDurationCheck property is not recognized when using serverless compute

Use VACUUM or table properties to handle retention instead.

Written by Rajeev kannan Thangaiah

Last published at: September 12th, 2024

Problem

You are trying to migrate to serverless compute but you are encountering an issue with the Apache Spark delta.retentionDurationCheck property not working correctly.

For example, this sample code snippet does not work when you are using serverless compute:

spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled=false")
spark.sql("VACUUM <catalog>.<schema>.<table_name> RETAIN 24 HOURS")

You want to maintain the retention rules that you have defined in your notebook and use serverless compute.

Cause

Databricks serverless compute does not support certain Spark properties, including spark.databricks.delta.retentionDurationCheck.enabled.

Serverless architecture is designed to optimize resource usage and scalability, but it also restricts certain configurations that are available with standard compute.

Users cannot directly set these properties in serverless environments, which can be a challenge when maintaining specific retention rules for Delta tables.

Solution

Use VACUUM instead of disabling the retention check

For example, if you want to use the recommended retention period of 7 days, use the following command in your notebook:

spark.sql("VACUUM <catalog>.<schema>.<table_name> RETAIN 168 HOURS")  # 168 hours = 7 days

For more information, review the Remove unused data files with vacuum (AWSAzure) documentation.

Use table properties to set a specific retention period, especially if you require a short duration

Use the following command in your notebook:

ALTER TABLE <catalog>.<schema>.<table_name> SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 24 hours')

For a list of supported Spark configuration parameters in serverless clusters, review the Serverless compute release notes (AWSAzure).

You can effectively manage retention rules in serverless compute without relying on an unsupported Spark configuration.

Was this article helpful?