Long running jobs are terminated with a WORKSPACE_UPDATE error

Enable continuous jobs for long running workloads.

Last published at: January 14th, 2025

Important

This article applies to compute on GKE. Compute on GKE is moving to GCE in 2025. You should update your permissions for GCE compute deployment as soon as possible.

Problem

You have long running streaming jobs that fail with a Cluster terminated by WORKSPACE_UPDATE error message.

[RunExecutionError] Cluster 0824-
184221-zlg0oon6 was terminated during the run (cluster state message: Cluster terminated by WORKSPACE_UPDATE)

Cause

Any cluster that runs for longer than 25 days is considered a long running cluster. These clusters may be restarted for container image updates. This is done on GCP for all customers.

These restarts happen when the GKE node pool is upgraded. The entire GKE cluster is not rebuilt during an upgrade. New node pools are created with the desired OS version. These node pool upgrades include critical security updates and CVE patches.

All new cluster launches are launched against the new node pools, but currently running clusters cannot be transitioned. Those clusters must be restarted in order to pick up the new images.

Info

Long running clusters are only restarted if they are running an unsupported image.

Solution

Enable the continuous job setting if you plan on using long running clusters.

The continuous job setting configures the job with an unlimited retry policy, so if the cluster needs to be restarted it continues where it left off. This setting allows a single instance of the running job. If the job fails many times in a row the continuous job setting uses exponential backoff to restart the job.

For more information, review the Run jobs continuously documentation.

Databricks Help Center

Important

Problem

Cause

Info

Solution

Contact Us