FileReadException error when trying to run streaming job reading from system tables

Increase job frequency or the maxVersionsPerRpc.

Written by lucas.rocha

Last published at: January 10th, 2025

Problem

When running a streaming job that reads from system tables using Delta Sharing, you encounter a FileReadException error. 

 

Example error message

com.databricks.sql.io.FileReadException: Error while reading file delta-sharing:/XXXXXXXXXXXXXXXX.uc-deltasharing%253A%252F%252Fsystem.access.audit%2wer3system.access.audit_XXXXXXXXXXXXXX. Caused by: org.apache.spark.SparkIOException: [HDFS_HTTP_ERROR.KEY_NOT_EXIST] When attempting to read from HDFS, HTTP request failed. Status 404 Not Found. Could not find key: HTTP request failed with status: HTTP/1.1 404 Not Found 
NoSuchKey The specified key does not exist.

 

Cause

Your scheduled job runs don’t occur frequently enough to keep up with the current source table version. You’re working in an outdated source table version and your streaming job is trying to read data files that no longer exist on the source system table.  

Additionally, when a job runs, if maxVersionsPerRpc is set at the default (100), the streaming query only processes 100 versions of the source table per remote procedure call (RPC) request. If the current source table version is over 100 versions away, a single job run can’t catch up your table version to the current version.

 

Solution

Increase the job frequency so the source table version you’re working with doesn't fall behind the current version. 

Alternatively, keep your job frequency at one batch per day but increase the maxVersionsPerRpc to 500 or an upper limit that allows for processing a day’s worth of data.  

 

Important

Increasing maxVersionsPerRpc means more files to be processed, increasing the likelihood of hitting Delta Sharing server limits (either 1000 files or five minutes).

 

 

For more information on streaming, review the Read Delta Sharing Tables documentation.

For more information on system tables, review the Monitor account activity with system tables (AWSAzureGCP) documentation.