Cannot access objects written by Databricks from outside Databricks

Learn how to resolve a HeadObject operation error and access objects written by Databricks from outside Databricks.

Written by Adam Pavlacka

Last published at: March 8th, 2022

Problem

When you attempt to access an object in an S3 location written by Databricks using the AWS CLI, the following error occurs:

ubuntu@0213-174944-clean111-10-93-15-150:~$ aws s3 cp s3://<bucket>/<location>/0/delta/sandbox/deileringDemo__m2/_delta_log/00000000000000000000.json .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Cause

S3 access fails because the bucket ACL allows access only to the bucket owner ("DisplayName": "bigdata_dataservices") or your account ("DisplayName": "infra").

This is expected behavior if you are trying to access Databricks objects stored in the Databricks File System (DBFS) root directory. The DBFS root bucket is assigned to Databricks for storing metadata, libraries, and so on. Therefore, the object owner (within the Databricks AWS account) is the canonical user ID assigned to the customer.

Objects written from a Databricks notebook into the DBFS root bucket receive the following object permissions:

{
  "Owner": {
    "DisplayName": "infra",
    "ID": "f65635fc2d277e71b19495a2a74d8170dd035d3e8aa6fc7187696eb42c6c276c"
  }
}

The "ID" value identifies a Databricks customer, and by extension the customer's objects in the Databricks account.

Solution

To access objects in DBFS, use the Databricks CLI, DBFS API, Databricks Utilities, or Apache Spark APIs from within a Databricks notebook.

If you need to access data from outside Databricks, migrate the data from the DBFS root bucket to another bucket where the bucket owner can have full control.

Indeed, Databricks does not recommend using the DBFS root directory for storing any user files or objects. It is always a best practice to create a different S3 directory and mount it to DBFS.

There are two migration scenarios:

Scenario 1: The destination Databricks data plane and S3 bucket are in the same AWS account

Make sure to attach the IAM role to the cluster where the data is currently located. The cluster needs the IAM role to enable it to write to the destination.

Configure Amazon S3 ACL as BucketOwnerFullControl in the Spark configuration:

spark.hadoop.fs.s3a.acl.default BucketOwnerFullControl

BucketOwnerFullControl recursively calls the putObjectACL property as well. Now you have the correct permissions on the file and can use S3 commands to perform backups.

Scenario 2: The destination Databricks data plane and S3 bucket are in different AWS accounts

The objects are still owned by Databricks because it is a cross-account write.

To avoid this scenario, you can assume a role using instance profiles with an AssumeRole policy.

Tips for migrating across accounts using the AWS API or CLI

If you are using IAM role instantiation and writing to a cross-account bucket where the Databricks data plane and S3 bucket are in different accounts, call the putObject and the putObject ACL as part of the aws s3api cp command:

aws s3api put-object-acl --bucket bucketname --key keyname --acl bucket-owner-full-control