Job failing with DELTA_CHANGE_DATA_FILE_NOT_FOUND error

Use ignoreMissingFile config or a new checkpoint.

Written by sidhant.sahu

Last published at: April 9th, 2025

Problem

When working with Delta Lake and change data feed (CDF) features, you encounter a file not found error message. 

pyspark.errors.exceptions.connect.SparkException: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file <file-path>.snappy.parquet. [DELTA_CHANGE_DATA_FILE_NOT_FOUND] File <file-path>.snappy.parquet referenced in the transaction log cannot be found. 

 

Cause

The error message typically includes the following explanation.

This can occur when data has been manually deleted from the file system rather than using the table `DELETE` statement. This request appears to be targeting Change Data Feed, if that is the case, this error can occur when the change data file is out of the retention period and has been deleted by the `VACUUM` statement. 

 

Solution

Start your stream with a new checkpoint location. Setting a new checkpoint location allows you to resume streaming from the beginning and avoid getting stuck on missing files. Ensure you make a backup of your older checkpoint location first.

 

If you don't want to use a new checkpoint location, add the following configuration to your cluster. 

  1. Navigate to your cluster and click on it to open the settings.
  2. Scroll down to Advanced options and click to expand. 
  3. Under the Spark tab, in the Spark config box, enter the following code. 
spark.sql.files.ignoreMissingFiles true

 

Important

Databricks recommends that once the stream has recovered, you remove the spark.sql.files.ignoreMissingFiles flag to ensure that any new missing file errors are not skipped without notification.

 

 

For more information on .ignoremissingfiles, refer to the Apache Spark Generic File Source Options documentation.