Checkpoint files not being deleted when using display()

Problem

You have a streaming job using display() to display DataFrames.

val streamingDF = spark.readStream.schema(schema).parquet(<input_path>)
display(streamingDF)

Checkpoint files are being created, but are not being deleted.

You can verify the problem by navigating to the root directory and looking in the /local_disk0/tmp/ folder. Checkpoint files remain in the folder.

Cause

The command display(streamingDF) is a memory sink implementation that can display the data from the streaming DataFrame for every micro-batch. A checkpoint directory is required to track the streaming updates.

If you have not specified a custom checkpoint location, a default checkpoint directory is created at /local_disk0/tmp/.

Databricks uses the checkpoint directory to ensure correct and consistent progress information. When a stream is shut down, either purposely or accidentally, the checkpoint directory allows Databricks to restart and pick up exactly where it left off.

If a stream is shut down by cancelling the stream from the notebook, the Databricks job attempts to clean up the checkpoint directory on a best-effort basis. If the stream is terminated in any other way, or if the job is terminated, the checkpoint directory is not cleaned up.

This is as designed.

Solution

You can prevent unwanted checkpoint files with the following guidelines.

  • You should not use display(streamingDF) in production jobs.
  • If display(streamingDF) is mandatory for your use case, you should manually specify the checkpoint directory by using the Apache Spark config option spark.sql.streaming.checkpointLocation.
  • If you manually specify the checkpoint directory, you should periodically delete any remaining files in this directory. This can be done on a weekly basis.