Problem
You notice your DLT pipeline begins failing with the following error.
org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = <id>, runId = <run-id>] terminated with exception: [DELTA_MERGE_MATERIALIZE_SOURCE_FAILED_REPEATEDLY] Keeping the source of the MERGE statement materialized has failed repeatedly. SQLSTATE: XXKST
Cause
Your pipeline is trying to use spot instances which have been terminated, resulting in the loss of nodes and pipeline failure.
To confirm this cause, navigate to your cluster. Click View details > Event log, and filter the events with NODES_LOST.
Then click the Spark UI tab to check for the following error message.
[CHECKPOINT_RDD_BLOCK_ID_NOT_FOUND] Checkpoint block rdd_XXXX not found!
Either the executor that originally checkpointed this partition is no longer alive, or the original RDD is unpersisted.
If this problem persists, you may consider using `rdd.checkpoint()` instead, which is slower than local checkpointing but more fault-tolerant. SQLSTATE: 56000
Solution
Switch to using on-demand instances. In DLT, add the following code snippet to the Settings > JSON settings of your pipeline.
{
"clusters": [
{
"label": "default",
"aws_attributes": {
"availability": "ON_DEMAND"
},
“...”:”…” }
]
}
If you want to continue to use spot instances, enable Apache Spark decommissioning to decrease the chance of data loss.
For more information, review the “Decommission spot instances” section of the Manage compute (AWS | Azure | GCP) documentation.