Problem
After enabling the Apache Spark dynamic allocation configuration on a cluster, you encounter a NODES_LOST
error when upsizing the cluster. The error message typically appears as the following.
Message: Compute lost at least one node. Reason: Communication lost Help: Communication with at least one worker node was unexpectedly lost. This issue can occur because of instance malfunction or network unavailability. Please retry and contact Databricks if the problem persists.
Cause
The NODES_LOST
error indicates communication loss with one or more worker nodes.
Enabling the Spark dynamic allocation configuration can conflict with Databricks' autoscaling mechanism, leading to node loss. Databricks clusters are managed by Databricks autoscaling, so the Spark dynamic allocation should not be configured.
Solution
- Navigate to your cluster and click to open the settings.
- Scroll down to Advanced options and click to expand.
- Under the Spark tab, find the
spark.dynamicAllocation.enabled true
configuration and remove it.
For more information regarding Spark configuration, review the Compute configuration reference (AWS | Azure | GCP) documentation.