Problem
When working with Delta Lake and change data feed (CDF) features, you encounter a file not found error message.
pyspark.errors.exceptions.connect.SparkException: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file <file-path>.snappy.parquet. [DELTA_CHANGE_DATA_FILE_NOT_FOUND] File <file-path>.snappy.parquet referenced in the transaction log cannot be found.
Cause
The error message typically includes the following explanation.
This can occur when data has been manually deleted from the file system rather than using the table `DELETE` statement. This request appears to be targeting Change Data Feed, if that is the case, this error can occur when the change data file is out of the retention period and has been deleted by the `VACUUM` statement.
Solution
Start your stream with a new checkpoint location. Setting a new checkpoint location allows you to resume streaming from the beginning and avoid getting stuck on missing files. Ensure you make a backup of your older checkpoint location first.
If you don't want to use a new checkpoint location, add the following configuration to your cluster.
- Navigate to your cluster and click on it to open the settings.
- Scroll down to Advanced options and click to expand.
- Under the Spark tab, in the Spark config box, enter the following code.
spark.sql.files.ignoreMissingFiles true
Important
Databricks recommends that once the stream has recovered, you remove the spark.sql.files.ignoreMissingFiles
flag to ensure that any new missing file errors are not skipped without notification.
For more information on .ignoremissingfiles
, refer to the Apache Spark Generic File Source Options documentation.