Problem
When using a cluster with Azure AD Credential Passthrough enabled, commands that you run on that cluster are able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.
For example, you can directly access data using
%python spark.read.csv("adl://myadlsfolder.azuredatalakestore.net/MyData.csv").collect()
However, when you try to access data directly using Sparklyr:
%r spark_read_csv(sc, name = "air", path = "adl://myadlsfolder.azuredatalakestore.net/MyData.csv")
It fails with the error:
com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token
Cause
The spark_read_csv function in Sparklyr is not able to extract the ADLS token to enable authentication and read data.
Solution
A workaround is to use an Azure application id, application key, and directory id to mount the ADLS location in DBFS:
%python # Get credentials and ADLS URI from Azure applicationId= <application-id> applicationKey= <application-key> directoryId= <directory-id> adlURI=<adl-uri> assert adlURI.startswith("adl:"), "Verify the adlURI variable is set and starts with adl:" # Mount ADLS location to DBFS dbfsMountPoint=<mount-point-location> dbutils.fs.mount( mount_point = dbfsMountPoint, source = adlURI, extra_configs = { "dfs.adls.oauth2.access.token.provider.type": "ClientCredential", "dfs.adls.oauth2.client.id": applicationId, "dfs.adls.oauth2.credential": applicationKey, "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/{}/oauth2/token".format(directoryId) })
Then, in your R code, read data using the mount point:
%r # Install Sparklyr %r install.packages("sparklyr") library(sparklyr) # Create a sparklyr connection sc <- spark_connect(method = "databricks") # Read Data %r myData = spark_read_csv(sc, name = "air", path = "dbfs:/<mount-point-location>/myData.csv")