Cluster fails with a DRIVER_EVICTION error

Ensure that the driver instances are not running on preemptive nodes.

Written by saikumar.divvela

Last published at: October 1st, 2024

Problem 

Your all-purpose and job clusters intermittently fail with a DRIVER_EVICTION error message in your GCP workspaces.

 

Cluster <cluster-id> was terminated. Reason: DRIVER_EVICTION (CLIENT_ERROR). Parameters: databricks_error_message:driver-xxxxxxxxxx-xxxxx for cluster <cluster-id> was killed by underlying node

 

You may also have cluster failures when attempting to install libraries or cluster timeouts.

 

Library installation failed for library due to infra fault. Error messages: Failed due to cluster termination, cluster was in state: Terminating
Could not reach driver of cluster <cluster-id> for 120 seconds.

Cause 

When you create a cluster with the Preemptible instances option selected in the Worker type section, the cluster configuration includes the PREEMPTIBLE_WITH_FALLBACK_GCP option. 

 

This means that the driver can be stopped at any time if Compute Engine requires those resources for other VMs. For more information, review the Compute Engine Preemptible VM instances documentation.

 

If your cluster is preempted it can result in a DRIVER_EVICTION error.

 

This error can also be caused by infrastructure issues, such as a problem with Google Kubernetes Engine.

Solution

To avoid spot termination, you should use an on-demand instance—at least for the driver instance. 

 

Although preemptible instances can be an economical choice for workloads when completion time is not a factor, they always run the risk of being terminated by the cloud provider.