A file referenced in the transaction log cannot be found

A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement.

Written by Adam Pavlacka

Last published at: May 10th, 2022

Problem

Your job fails with an error message: A file referenced in the transaction log cannot be found.

Example stack trace:

Error in SQL statement: SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 106, XXX.XXX.XXX.XXX, executor 0): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/<path>/part-00000-da504c51-3bb4-4406-bb99-3566c0e2f743-c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions ... Caused by: java.io.FileNotFoundException: dbfs:/mnt/<path>/part-00000-da504c51-3bb4-4406-bb99-3566c0e2f743-c000.snappy.parquet ...

Cause

There are three common causes for this error message.

  • Cause 1: You start the Delta streaming job, but before the streaming job starts processing, the underlying data is deleted.
  • Cause 2: You perform updates to the Delta table, but the transaction files are not updated with the latest details.
  • Cause 3: You attempt multi-cluster read or update operations on the same Delta table, resulting in a cluster referring to files on a cluster that was deleted and recreated.

Solution

  • Cause 1: You should use a new checkpoint directory, or set the Spark property spark.sql.files.ignoreMissingFiles to true in the cluster’s Spark Config.
  • Cause 2: Wait for the data to load, then refresh the table. You can also run fsck to update the transaction files with the latest details.
Delete

Info

fsck removes any file entries that cannot be found in the underlying file system from the transaction log of a Delta table.

  • Cause 3: When tables have been deleted and recreated, the metadata cache in the driver is incorrect. You should not delete a table, you should always overwrite a table. If you do delete a table, you should clear the metadata cache to mitigate the issue. You can use a Python or Scala notebook command to clear the cache.
%python

spark._jvm.com.databricks.sql.transaction.tahoe.DeltaLog.clearCache()
%scala

com.databricks.sql.transaction.tahoe.DeltaLog.clearCache()