Problem
When running a streaming job that reads from system tables using Delta Sharing, you encounter a FileReadException
error.
Example error message
com.databricks.sql.io.FileReadException: Error while reading file delta-sharing:/XXXXXXXXXXXXXXXX.uc-deltasharing%253A%252F%252Fsystem.access.audit%2wer3system.access.audit_XXXXXXXXXXXXXX. Caused by: org.apache.spark.SparkIOException: [HDFS_HTTP_ERROR.KEY_NOT_EXIST] When attempting to read from HDFS, HTTP request failed. Status 404 Not Found. Could not find key: HTTP request failed with status: HTTP/1.1 404 Not Found
NoSuchKey The specified key does not exist.
Cause
Your scheduled job runs don’t occur frequently enough to keep up with the current source table version. You’re working in an outdated source table version and your streaming job is trying to read data files that no longer exist on the source system table.
Additionally, when a job runs, if maxVersionsPerRpc
is set at the default (100
), the streaming query only processes 100 versions of the source table per remote procedure call (RPC) request. If the current source table version is over 100 versions away, a single job run can’t catch up your table version to the current version.
Solution
Increase the job frequency so the source table version you’re working with doesn't fall behind the current version.
Alternatively, keep your job frequency at one batch per day but increase the maxVersionsPerRpc
to 500
or an upper limit that allows for processing a day’s worth of data.
Important
Increasing maxVersionsPerRpc
means more files to be processed, increasing the likelihood of hitting Delta Sharing server limits (either 1000 files or five minutes).
For more information on streaming, review the Read Delta Sharing Tables documentation.
For more information on system tables, review the Monitor account activity with system tables (AWS | Azure | GCP) documentation.