Problem: Cluster Failed to Launch

This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs.

Cluster timeout

Error messages:

Driver failed to start in time

INTERNAL_ERROR: The Spark driver failed to start within 300 seconds

Cluster failed to be healthy within 200 seconds

Cause

The cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the Hive metastore libraries from a maven repo. A cluster downloads almost 200 JAR files, including dependencies. If the Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster launch fails. This can occur because JAR downloading is taking too much time.

Solution

Store the Hive libraries in DBFS and access them locally from the DBFS location. See Spark Options.

Global or cluster-specific init scripts

Error message:

The cluster could not be started in 50 minutes. Cause: Timed out with exception after <xxx> attempts

Cause

Init scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker machine to run the scripts locally. All RPCs must return their status before the process continues. If any RPC hits an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout can be hit, causing the cluster setup job to fail.

Solution

Use a cluster-scoped init script instead of global or cluster-named init scripts. With cluster-scoped init scripts, Databricks does not use synchronous blocking of RPCs to fetch init script execution status.

Too many libraries installed in cluster UI

Error message:

Library installation timed out after 1800 seconds. Libraries that are not yet installed:

Cause

This is usually an intermittent problem due to network problems.

Solution

Usually you can fix this problem by re-running the job or restarting the cluster.

The library installer is configured to time out after 3 minutes. While fetching and installing jars, a timeout can occur due to network problems. To mitigate this issue, you can download the libraries from maven to a DBFS location and install it from there.

Cloud provider limit

Error message:

Cluster terminated. Reason: Cloud Provider Limit

Cause

This error is usually returned by the cloud provider.

Solution

See the cloud provider error information in cluster unexpected termination.

Cloud provider shutdown

Error message:

Cluster terminated. Reason: Cloud Provider Shutdown

Cause

This error is usually returned by the cloud provider.

Solution

See the cloud provider error information in cluster unexpected termination.