When running notebooks or jobs on a cluster, they run successfully multiple times, but sometimes the driver stops working and error messages will display, such as:
Driver is temporarily unavailable.
The spark driver has stopped unexpectedly and is restarting.
Lost connection to cluster. The notebook may have been detached.
If you check the cluster event logs, you'll find the Driver_Not_Responding event with a message related to garbage collection (GC):
Driver is up but is not responsive, likely due to GC.
If you check the Ganglia metrics when the issue occurs, you will notice that the driver node experiencing a high load (for example, showing an orange/red color).
To get the driver’s IP so that you can filter on it in the Ganglia metrics dashboard, you can navigate to the cluster’s Spark cluster UI > Master tab and get the IP of the driver (the Spark Master) from the first line: `Spark Master at spark://x.x.x.x:port`.
One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver crashes with an out of memory (OOM) condition and gets restarted or becomes unresponsive due to frequent full garbage collection. The reason for the memory bottleneck can be any of the following:
- The driver instance type is not optimal for the load executed on the driver.
- There are memory-intensive operations executed on the driver.
- There are many notebooks or jobs running in parallel on the same cluster.
The solution varies from case to case. The easiest way to resolve the issue in the absence of specific details is to increase the driver memory. You can increase driver memory simply by upgrading the driver node type on the cluster edit page in your Databricks workspace.
Other points to consider:
- Avoid memory intensive operations like:
- collect() operator, which brings a large amount of data to the driver.
- Conversion of a large DataFrame to Pandas DataFrame using the toPandas() function.
- If these operations are essential, ensure that enough driver memory is available; otherwise, look for alternatives that can parallelize the execution of your code. For example, use Spark instead of Pandas for data processing, and Spark ML instead of regular Python machine-learning libraries (for example, scikit-learn).
- Avoid running batch jobs on a shared interactive cluster.
- Distribute the workloads into different clusters. No matter how big the cluster is, the functionalities of the Spark driver cannot be distributed within a cluster.
- Do a periodic restart of the interactive cluster (on daily basis for example) during low loads to clear out any remaining objects in the memory from previous runs. You can use the cluster's restart REST API endpoint along with your favorite automation tool to automate this.
- Run the specific notebook in isolation on a cluster to evaluate exactly how much memory is required to execute the notebook successfully.