Problem
Delta Live Tables supports auto-vacuum by default. You setup a Delta Live Tables pipeline, but notice VACUUM is not running automatically.
Cause
A Delta Live Tables pipeline needs a separate maintenance cluster configuration (AWS | Azure | GCP) inside the pipeline settings to ensure VACUUM runs automatically. If the maintenance cluster is not specified within the pipeline JSON file or if the maintenance cluster does not have access to your storage location, then VACUUM does not run.
Example configuration
In this example Delta Live Tables pipeline JSON file, there is a default label which identifies the configuration for the default cluster. This should also contain a maintenance label that identifies the configuration for the maintenance cluster.
Since the maintenance cluster configuration is not present, VACUUM does not automatically run.
AWS
{ "clusters": [ { "label": "default", "node_type_id": "c5.4xlarge", "driver_node_type_id": "c5.4xlarge", "num_workers": 20, "spark_conf": { "spark.databricks.io.parquet.nativeReader.enabled": "false" }, "aws_attributes": { "instance_profile_arn": "arn:aws:..." } } ] }Delete
Azure
{ "clusters": [ { "label": "default", "node_type_id": "Standard_D3_v2", "driver_node_type_id": "Standard_D3_v2", "num_workers": 20, "spark_conf": { "spark.databricks.io.parquet.nativeReader.enabled": "false" } } ] }Delete
GCP
{ "clusters": [ { "label": "default", "node_type_id": "n1-standard-4", "driver_node_type_id": "n1-standard-4", "num_workers": 20, "spark_conf": { "spark.databricks.io.parquet.nativeReader.enabled": "false" } } ] }Delete
Solution
Configure a maintenance cluster in the Delta Live Tables pipeline JSON file.
You have to specify configurations for two different cluster types:
- A default cluster where all processing is performed.
- A maintenance cluster where daily maintenance tasks are run.
Each cluster is identified using the label field.
The maintenance cluster is responsible for performing VACUUM and other maintenance tasks.
AWS
{ "clusters": [ { "label": "default", "node_type_id": "<instance-type>", "driver_node_type_id": "<instance-type>", "num_workers": 20, "spark_conf": { "spark.databricks.io.parquet.nativeReader.enabled": "false" }, "aws_attributes": { "instance_profile_arn": "arn:aws:..." } }, { "label": "maintenance", "aws_attributes": { "instance_profile_arn": "arn:aws:..." } } ] }Delete
Azure
{ "clusters": [ { "label": "default", "node_type_id": "Standard_D3_v2", "driver_node_type_id": "Standard_D3_v2", "num_workers": 20, "spark_conf": { "spark.databricks.io.parquet.nativeReader.enabled": "false" } }, { "label": "maintenance" } ] }Delete
GCP
{ "clusters": [ { "label": "default", "node_type_id": "n1-standard-4", "driver_node_type_id": "n1-standard-4", "num_workers": 20, "spark_conf": { "spark.databricks.io.parquet.nativeReader.enabled": "false" } }, { "label": "maintenance" } ] }Delete