Error when reading data from ADLS Gen1 with Sparklyr

Learn how to resolve errors that occur when reading data from Azure Data Lake Storage Gen1 with Sparklyr in Databricks.

Written by Adam Pavlacka

Last published at: December 9th, 2022

Problem

When using a cluster with Azure AD Credential Passthrough enabled, commands that you run on that cluster are able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.

For example, you can directly access data using

%python

spark.read.csv("adl://myadlsfolder.azuredatalakestore.net/MyData.csv").collect()

However, when you try to access data directly using Sparklyr:

%r

spark_read_csv(sc, name = "air", path = "adl://myadlsfolder.azuredatalakestore.net/MyData.csv")

It fails with the error:

com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token

Cause

The spark_read_csv function in Sparklyr is not able to extract the ADLS token to enable authentication and read data.

Solution

A workaround is to use an Azure application id, application key, and directory id to mount the ADLS location in DBFS:

%python

# Get credentials and ADLS URI from Azure
applicationId= <application-id>
applicationKey= <application-key>
directoryId= <directory-id>
adlURI=<adl-uri>
assert adlURI.startswith("adl:"), "Verify the adlURI variable is set and starts with adl:"

# Mount ADLS location to DBFS
dbfsMountPoint=<mount-point-location>
dbutils.fs.mount(
  mount_point = dbfsMountPoint,
  source = adlURI,
  extra_configs = {
    "dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
    "dfs.adls.oauth2.client.id": applicationId,
    "dfs.adls.oauth2.credential": applicationKey,
    "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/{}/oauth2/token".format(directoryId)
  })

Then, in your R code, read data using the mount point:

%r

# Install Sparklyr
%r
install.packages("sparklyr")
library(sparklyr)
# Create a sparklyr connection
sc <- spark_connect(method = "databricks")

# Read Data
%r
myData = spark_read_csv(sc, name = "air", path = "dbfs:/<mount-point-location>/myData.csv")