Apache Spark jobs failing due to stage failure when using spot instances in a cluster

Use on-demand nodes instead of spot instances.

Written by Vidhi Khaitan

Last published at: August 11th, 2025

Problem

When using spot instances in your cluster, your Apache Spark jobs fail due to stage failures. 

"org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 2923 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again."

 

Cause

Spot instances can be preempted, leading to the loss of nodes in the cluster. When nodes are lost, the shuffle map stage fails and Spark cannot rollback the ResultStage to re-process input data. 

 

Solution

Use on-demand nodes instead of spot instances. In the cluster configuration, navigate to the Advanced tab and slide the slider to the extreme right to select on-demand nodes for workers.

 

For more information, review the “Decommission spot instances” section of the Manage compute (AWS | Azure | GCP) documentation.