Problem
When using spot instances in your cluster, your Apache Spark jobs fail due to stage failures.
"org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 2923 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again."
Cause
Spot instances can be preempted, leading to the loss of nodes in the cluster. When nodes are lost, the shuffle map stage fails and Spark cannot rollback the ResultStage
to re-process input data.
Solution
Use on-demand nodes instead of spot instances. In the cluster configuration, navigate to the Advanced tab and slide the slider to the extreme right to select on-demand nodes for workers.
For more information, review the “Decommission spot instances” section of the Manage compute (AWS | Azure | GCP) documentation.