Enabling Dynamic Allocation leads to NODES_LOST scenario

Enable Autoscaling when you create a Databricks cluster.

Written by MuthuLakshmi.AN

Last published at: January 29th, 2025

Problem

When you enable dynamic allocation by setting spark.dynamicAllocation.enabled  to true, you experience unexpected NODES_LOST scenarios.

In your event logs you see the following message. 

 

EVENT TYPE : NODES LOST
MESSAGE : Compute lost at least one node. Reason: Communication lost

 

And in your backend cluster logs you see the following error message. 

 

TerminateInstances worker_env_id: "workerenv-XXXXXXXXXXX"
instance_ids: "i-XXXXXX"
instance_termination_reason_code: LOST_EXECUTOR_DETECTED

 

 

Cause

Dynamic allocation is an Apache Spark feature exclusive to YARN and is not supported in Databricks environments. Databricks clusters are instead managed by Databricks Autoscaling. 

 

Solution

Enable Autoscaling when you create a Databricks cluster. For more information, review the “Enable autoscaling” section of the Compute configuration reference (AWS | Azure | GCP) documentation.