Streaming job gets stuck writing to checkpoint

Streaming job appears to be stuck even though no error is thrown. You are using DBFS for checkpoint storage, but it has filled up.

Written by Jose Gonzalez

Last published at: May 19th, 2022

Problem

You are monitoring a streaming job, and notice that it appears to get stuck when processing data.

When you review the logs, you discover the job gets stuck when writing data to a checkpoint.

INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/state/0/89]:
INFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef@56a4cb80
INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=37),dir = dbfs:/FileStore/R_CHECKPOINT5/state/0/37]:
INFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef@56a4cb80
INFO HDFSBackedStateStoreProvider: Deleted files older than 313920 for HDFSStateStoreProvider[id = (op=0,part=25),dir = dbfs:/FileStore/PYTHON_CHECKPOINT5/state/0/25]:

Cause

You are trying to use a checkpoint location in your local DBFS path.

%scala

query = streamingInput.writeStream.option("checkpointLocation", "/FileStore/checkpoint").start()

Solution

You should use persistent storage for streaming checkpoints.

You should not use DBFS for streaming checkpoint storage.