Cluster fails to initialize after a Databricks Runtime upgrade

Check your init scripts and then your Apache Spark configurations.

Written by walter.camacho

Last published at: January 30th, 2025

Problem

After running a Databricks Runtime upgrade on a cluster, you receive an initialization failure with error message. 

 

Spark error: Spark encountered an error on startup. This issue can be caused by invalid Spark configurations or malfunctioning init scripts. Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists. Internal error message: Spark error: Driver down cause: driver state change (exit code: 134)

 

Cause

The Databricks Runtime upgrade may have different Apache Spark configurations than the version you were using before. For example, spark.shuffle.spill is deprecated in Databricks Runtime 16.1. 

There may also be inconsistencies with the init scripts set within the updated Databricks Runtime, such as corrupted files, unsupported library dependencies, or inaccessible files.

 

Solution

Start by checking the init script. 

  1. Remove the init script from your cluster configuration.
  2. Restart the cluster. 
  3. If the cluster starts without the init script, there is a problem with the init script to investigate further.  
    1. First determine if the file is reachable. When accessing the file using a workspace in a volume or Databricks File System (DBFS), make sure the filepath still exists and the permissions are properly set. 
    2. If the file is reachable, review the init script content to check for dependency conflicts in the libraries included in the script. You can either attempt to install by writing the same code in a notebook using the same cluster configuration or look at your driver logs for error messages indicating library issues.
  4. If the init script worked previously with an older Databricks Runtime version, test if moving to this previous Databricks Runtime version works as expected. 
    1. If the init script works, there may be configurations or dependencies in the script that are not applicable to the current Databricks Runtime.

If the cluster does not start after removing the init script, the issue is related to Spark configuration.

To check your Spark configuration: 

  1. Review the driver logs for errors in log4j
    1. SparkExecuteStatementOperation in particular typically indicates which Spark configuration module is failing.
  2. Remove the failing Spark configuration.
  3. Restart the cluster.

You can also generally remove your Spark configurations and add them back individually to test.  

 

For more information about configurations, dependencies, and changes in a given Databricks Runtime, review the Databricks Runtime release notes versions and compatibility (AWSAzureGCP) documentation.