Delta Live Tables pipelines are not running VACUUM automatically

You must have a maintenance cluster defined for VACUUM to run automatically.

Written by priyanka.biswas

Last published at: February 2nd, 2023

Problem

Delta Live Tables supports auto-vacuum by default. You setup a Delta Live Tables pipeline, but notice VACUUM is not running automatically. 

Cause

A Delta Live Tables pipeline needs a separate maintenance cluster configuration (AWS | Azure | GCP) inside the pipeline settings to ensure VACUUM runs automatically. If the maintenance cluster is not specified within the pipeline JSON file or if the maintenance cluster does not have access to your storage location, then VACUUM does not run.

Example configuration

In this example Delta Live Tables pipeline JSON file, there is a default label which identifies the configuration for the default cluster. This should also contain a maintenance label that identifies the configuration for the maintenance cluster.

Since the maintenance cluster configuration is not present, VACUUM does not automatically run.

AWS

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "c5.4xlarge",
      "driver_node_type_id": "c5.4xlarge",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      },
      "aws_attributes": {
        "instance_profile_arn": "arn:aws:..."
      }
    }
  ]
}
Delete

Azure

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "Standard_D3_v2",
      "driver_node_type_id": "Standard_D3_v2",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      }
    }
  ]
}
Delete

GCP

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "n1-standard-4",
      "driver_node_type_id": "n1-standard-4",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      }
    }
  ]
}
Delete

Solution

Configure a maintenance cluster in the Delta Live Tables pipeline JSON file.

You have to specify configurations for two different cluster types:

  • A default cluster where all processing is performed.
  • A maintenance cluster where daily maintenance tasks are run. 

Each cluster is identified using the label field.

The maintenance cluster is responsible for performing VACUUM and other maintenance tasks.

AWS

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "<instance-type>",
      "driver_node_type_id": "<instance-type>",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      },
      "aws_attributes": {
        "instance_profile_arn": "arn:aws:..."
      }
    },
    {
      "label": "maintenance",
      "aws_attributes": {
        "instance_profile_arn": "arn:aws:..."
      }
    }
  ]
}
Delete

Info

If the maintenance cluster requires access to storage via an instance profile, you need to specify it with instance_profile_arn.

Delete

Azure

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "Standard_D3_v2",
      "driver_node_type_id": "Standard_D3_v2",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      }
    },
    {
      "label": "maintenance"
    }
  ]
}
Delete

Info

If you need to use Azure Data Lake Storage credential passthrough, or another configuration to access your storage location, specify it for both the default cluster and the maintenance cluster.

Delete

GCP

{
  "clusters": [
    {
      "label": "default",
      "node_type_id": "n1-standard-4",
      "driver_node_type_id": "n1-standard-4",
      "num_workers": 20,
      "spark_conf": {
        "spark.databricks.io.parquet.nativeReader.enabled": "false"
      }
    },
    {
      "label": "maintenance"
    }
  ]
}
Delete

Info

When using cluster policies to configure Delta Live Tables clusters, you should apply a single policy to both the default and maintenance clusters.

Delete