Problem
When using spot instances in your cluster, your Apache Spark jobs fail due to stage failures.
"org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 2923 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again."
Cause
Spot instances can be preempted, leading to the loss of nodes in the cluster. When nodes are lost, the shuffle map stage fails and Spark cannot rollback the ResultStage
to re-process input data.
Solution
Use on-demand nodes instead of spot instances. In the cluster configuration, navigate to the Advanced tab and slide the slider to the extreme right to select on-demand nodes for workers.
Alternatively, if you want to continue to use spot instances, you can decrease the chance of data loss by enabling Spark decommissioning. Decommissioning allows migration of data before spot node preemption.
Important
Decommissioning is a best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor.
To decommission, add the following configurations to the cluster configuration under Advanced options > Spark.
spark.decommission.enabled true
spark.storage.decommission.enabled true
spark.storage.decommission.shuffleBlocks.enabled true
spark.storage.decommission.rddBlocks.enabled true
Additionally, under Advanced options > Environment, add:
SPARK_WORKER_OPTS="-Dspark.decommission.enabled=true"