Problem
When using pandas DataFrames to work with files in an S3 Unity Catalog storage location, you encounter the following error.
ImportError: Missing optional dependency 'fsspec'. Use pip or conda to install fsspec.
The error arises even after you install fsspec
library. Further, you notice that no results are displayed in the notebook cell.
Cause
Managing and manipulating files located in cloud object storage using pandas is not supported.
Solution
In order to work with files located in your cloud using Unity Catalog with pandas, you can use a volume or service credentials with Boto3.
Use a volume
- Create a Unity Catalog volume on the same cloud location. For details refer to the Create and manage volumes documentation.
- Then, refer to your created volume as a path in your pandas code. For details, refer to the “Work with files in Unity Catalog volumes” section of the Work with files on Databricks documentation.
Code example
The following code defines your external volume location, creates a test DataFrame, and then saves the DataFrame using pandas with the volume path.
%python
# Define external Volume location
external_location_as_volume = "/Volumes/<your-volume-path>"
# Create a test dataframe
data = {'Column1': [1, 2, 3], 'Column2': ['A', 'B', 'C']}
test_df = pd.DataFrame(data)
# Save using pandas with volume path:
test_df.to_csv(f"{external_location_as_volume}/<file-name>.csv")
Use service credentials with Boto3
1. Create a service credential with an IAM role associated to the S3 bucket. (Do not use a storage credential). For more information, refer to the Create service credentials documentation.
2. Create a new notebook and import Boto3 and pandas.
import boto3
import pandas as pd
3. Then call the Databricks function dbutils.credentials.getServiceCredentialsProvider
and refer to your credential name from Step 1 to initiate the Boto3 session. Refer to the following code for reference on where to include the credential name.
boto3_session = boto3.Session(botocore_session=dbutils.credentials.getServiceCredentialsProvider('<your-service-credential-name>'), region_name='<your-region>')
s3 = boto3_session.client('s3')
4. Use this Boto3 session along with your pandas DataFrame code to refer to the cloud path in your notebook. Below is a full example of how to perform this.
Full code example
Run the following code to create a pandas DataFrame, convert it to CSV format, and upload it to a specified path in an AWS S3 bucket using credentials managed through Databricks service credentials.
# Step 2 of the KB
%python
import io
import boto3
import pandas as pd
# Step 3 of the KB - Create an S3 client using the credentials created in step 1
boto3_session = boto3.Session(botocore_session=dbutils.credentials.getServiceCredentialsProvider('<your-service-credential-name>'), region_name='<your-region>')
s3 = boto3_session.client('s3')
#From here and below is an example how to proceed from step 4
#s3 references
AWS_S3_BUCKET = "<bucket-name>"
subpath = '<target-bucket-subpath>'
object_name = '<your-file-name>.csv'
# Construct the full key which will be use as the S3 + path reference
key = subpath + object_name
# Create a test dataframe as example to use this code
# Here you can substitute your already created DataFrame if you have one already
data = {'Column1': [1, 2, 3], 'Column2': ['A', 'B', 'C']}
test_df = pd.DataFrame(data)
test_df.display()
# Write DataFrame to a CSV buffer - Required to work on this approach
with io.StringIO() as csv_buffer:
test_df.to_csv(csv_buffer, index=False)
# Upload the CSV buffer to S3 using put_object
response = s3.put_object(
Bucket=AWS_S3_BUCKET,
Key=key,
Body=csv_buffer.getvalue()
)
# Check the status of the upload- If all the references were adjusted properly from the code snippet, you should have a 200 status returned here, and the file should be created in your S3 bucket path
status = response.get("ResponseMetadata", {}).get("HTTPStatusCode")
if status == 200:
print(f"Successful S3 put_object response. Status - {status}")
else:
print(f"Unsuccessful S3 put_object response. Status - {status}")
For more information about Boto3, refer to the AWS Boto3 documentation.