Job run fails with error message “Could not reach driver of cluster”

Increase the REPL launch timeout.

Written by rushali.kumari

Last published at: July 17th, 2025

Problem

When you try to run a job, you notice it fails intermittently with the following error message. 

Error: Run failed with error message: Could not reach driver of cluster <cluster-id>.

 

Cause

A high load on the driver node (approaching or at 100%) causes CPU thrashing. This thrashing prevents Python REPL threads (ipykernel) from starting within the expected timeout of 80 seconds, instead becoming sluggish or unresponsive during startup. Failing to start within the expected timeout leads to failures. 

 

Troubleshoot your case 

The error "Could not reach driver of cluster <cluster-id>" can occur due to several different reasons. Use the following troubleshooting steps to verify the cause of your error matches the cause in this KB article. 

  1. Check whether the job runs multiple tasks concurrently, which can increase the load on the driver.
  2. During the time of failure, check if the driver’s CPU and memory utilization are unusually high (approaching or at 100%).
  3. Look for the following error trace in the driver logs. This error indicates a REPL (Read-Eval-Print Loop) startup failure due to timeout, often caused by too many REPLs being created simultaneously.
Failed to start repl ReplId-<id>
com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: 
Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds.

 

Solution

Increase the REPL launch timeout by setting the following Apache Spark configuration. This gives the REPL process more time to initialize, which helps prevent failures under high load.

spark.databricks.driver.ipykernel.launchTimeoutSeconds 300

 

For details on how to apply Spark configs, refer to the “Spark configuration” section of the Compute configuration reference (AWS | Azure | GCP) documentation.

 

Preventative measures

A persistently overloaded driver is not a sustainable setup. For long-term stability and performance: 

  • Lower the number of simultaneous job submissions to prevent overwhelming the driver.
  • Consider switching to a driver with more CPU cores to better handle parallel REPL creation and workload.