Apache Spark jobs failing due to stage failure when using spot instances in a cluster

Use on-demand nodes instead of spot instances.

Written by Vidhi Khaitan

Last published at: November 26th, 2024

Problem

When using spot instances in your cluster, your Apache Spark jobs fail due to stage failures. 

"org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 2923 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again."

 

Cause

Spot instances can be preempted, leading to the loss of nodes in the cluster. When nodes are lost, the shuffle map stage fails and Spark cannot rollback the ResultStage to re-process input data. 

 

Solution

Use on-demand nodes instead of spot instances. In the cluster configuration, navigate to the Advanced tab and slide the slider to the extreme right to select on-demand nodes for workers.


Alternatively, if you want to continue to use spot instances, you can decrease the chance of data loss by enabling Spark decommissioning. Decommissioning allows migration of data before spot node preemption.

 

Important

Decommissioning is a best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor.

 

To decommission, add the following configurations to the cluster configuration under Advanced options > Spark

spark.decommission.enabled true
spark.storage.decommission.enabled true
spark.storage.decommission.shuffleBlocks.enabled true
spark.storage.decommission.rddBlocks.enabled true

Additionally, under Advanced options > Environment, add:
SPARK_WORKER_OPTS="-Dspark.decommission.enabled=true"