Problem
Your all-purpose and job clusters intermittently fail with a DRIVER_EVICTION
error message in your GCP workspaces.
Cluster <cluster-id> was terminated. Reason: DRIVER_EVICTION (CLIENT_ERROR). Parameters: databricks_error_message:driver-xxxxxxxxxx-xxxxx for cluster <cluster-id> was killed by underlying node
You may also have cluster failures when attempting to install libraries or cluster timeouts.
Library installation failed for library due to infra fault. Error messages: Failed due to cluster termination, cluster was in state: Terminating
Could not reach driver of cluster <cluster-id> for 120 seconds.
Cause
When you create a cluster with the Preemptible instances option selected in the Worker type section, the cluster configuration includes the PREEMPTIBLE_WITH_FALLBACK_GCP
option.
This means that the driver can be stopped at any time if Compute Engine requires those resources for other VMs. For more information, review the Compute Engine Preemptible VM instances documentation.
If your cluster is preempted it can result in a DRIVER_EVICTION
error.
This error can also be caused by infrastructure issues, such as a problem with Google Kubernetes Engine.
Solution
To avoid spot termination, you should use an on-demand instance—at least for the driver instance.
Although preemptible instances can be an economical choice for workloads when completion time is not a factor, they always run the risk of being terminated by the cloud provider.