Problem
You are monitoring a streaming job, and notice that it appears to get stuck when processing data.
When you review the logs, you discover the job gets stuck when writing data to a checkpoint.
INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/state/0/89]: INFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef@56a4cb80 INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=37),dir = dbfs:/FileStore/R_CHECKPOINT5/state/0/37]: INFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef@56a4cb80 INFO HDFSBackedStateStoreProvider: Deleted files older than 313920 for HDFSStateStoreProvider[id = (op=0,part=25),dir = dbfs:/FileStore/PYTHON_CHECKPOINT5/state/0/25]:
Cause
You are trying to use a checkpoint location in your local DBFS path.
%scala query = streamingInput.writeStream.option("checkpointLocation", "/FileStore/checkpoint").start()
Solution
You should use persistent storage for streaming checkpoints.
You should not use DBFS for streaming checkpoint storage.