NODES_LOST error during cluster upsizing when Apache Spark dynamic allocation is enabled

Remove the spark.dynamicAllocation.enabled Spark config from the compute configuration.

Written by Gihyeon Lee

Last published at: February 28th, 2025

Problem

After enabling the Apache Spark dynamic allocation configuration on a cluster, you encounter a NODES_LOST error when upsizing the cluster. The error message typically appears as the following.

Message: Compute lost at least one node. Reason: Communication lost Help: Communication with at least one worker node was unexpectedly lost. This issue can occur because of instance malfunction or network unavailability. Please retry and contact Databricks if the problem persists.

 

Cause

The NODES_LOST error indicates communication loss with one or more worker nodes. 

Enabling the Spark dynamic allocation configuration can conflict with Databricks' autoscaling mechanism, leading to node loss. Databricks clusters are managed by Databricks autoscaling, so the Spark dynamic allocation should not be configured.

 

Solution

  1. Navigate to your cluster and click to open the settings. 
  2. Scroll down to Advanced options and click to expand.  
  3. Under the Spark tab, find the spark.dynamicAllocation.enabled true configuration and remove it.

For more information regarding Spark configuration, review the Compute configuration reference (AWSAzureGCP) documentation.