Problem
You may encounter repeated DatabricksS3LoggingException
errors while attempting to read a binary file in an Apache Spark structured streaming job. Despite these errors, the job itself does not fail, indicating that the IAM role has all the necessary permissions. The error message typically includes a 403 Forbidden
status code, suggesting an access issue with the S3 bucket.
Example
You try to read a binary file.
spark.read.format("binaryFile").load(file_path)
Spark configuration used in the example job:
spark.hadoop.fs.s3a.credentialsType AssumeRole
spark.hadoop.fs.s3a.stsAssumeRole.arn arn:aws:iam::xxxx:role/instance-profile-role
spark.hadoop.fs.s3a.canned.acl BucketOwnerFullControl
After trying to read the binary file, you get a DatabricksS3LoggingUtils
error in the driver log4j2
logs.
ERROR DatabricksS3LoggingUtils$:V3: S3 request failed with com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://<s3-bucket>.<s3-region>.amazonaws.com _delta_log {}
Cause
The DatabricksS3LoggingException
error is related to the way Spark handles binary file reads in structured streaming jobs.
When reading binary format files, Spark checks for the presence of delta files under the specified path (file_path
). Since the files are not in delta format, Spark does not have the necessary permissions to access the path, resulting in a 403 Forbidden
error. This behavior is due to the default configuration of Spark, which expects delta format files and attempts to access the _delta_log
directory.
Solution
-
Disable the delta format check by setting
spark.databricks.delta.formatCheck.enabled
tofalse
in the compute cluster's Spark config. - Ensure the IAM role used in the Spark job has the necessary permissions to access the S3 bucket.
-
Verify the policies for the instance profile IAM role and the
arn:aws:iam::xxxx:role/instance-profile-role
.
For more information, review the Configure S3 access with an instance profile tutorial.