Important
This article applies to compute on GKE. Compute on GKE is moving to GCE in 2025. You should update your permissions for GCE compute deployment as soon as possible.
Problem
You have long running streaming jobs that fail with a Cluster terminated by WORKSPACE_UPDATE
error message.
[RunExecutionError] Cluster 0824-
184221-zlg0oon6 was terminated during the run (cluster state message: Cluster terminated by WORKSPACE_UPDATE)
Cause
Any cluster that runs for longer than 25 days is considered a long running cluster. These clusters may be restarted for container image updates. This is done on GCP for all customers.
These restarts happen when the GKE node pool is upgraded. The entire GKE cluster is not rebuilt during an upgrade. New node pools are created with the desired OS version. These node pool upgrades include critical security updates and CVE patches.
All new cluster launches are launched against the new node pools, but currently running clusters cannot be transitioned. Those clusters must be restarted in order to pick up the new images.
Info
Long running clusters are only restarted if they are running an unsupported image.
Solution
Enable the continuous job setting if you plan on using long running clusters.
The continuous job setting configures the job with an unlimited retry policy, so if the cluster needs to be restarted it continues where it left off. This setting allows a single instance of the running job. If the job fails many times in a row the continuous job setting uses exponential backoff to restart the job.
For more information, review the Run jobs continuously documentation.