Problem
When attempting to read or write PyArrow-created feather files from an S3 bucket using a Unity Catalog (UC) cluster, the operation fails with the following error.
FileNotFoundError: [Errrno 2] No such file or directory: ‘file_path’
The failure persists when using spark.read.csv
and spark.read.format("arrow")
methods. You notice the operation works in a non-UC cluster.
Cause
The spark.read.csv
is designed to read CSV files, and the spark.read.format("arrow")
method for reading Arrow files. Neither method is compatible with feather files.
Additionally, your UC cluster configuration may lack the necessary PyArrow library required for reading feather files.
Solution
- Ensure that the PyArrow library is installed in your Databricks Runtime cluster.
- Use the pandas library's
read_feather()
function to read the feather file. The following code snippet provides an example.
import pandas as pd
# Assuming the feather file is at '</dbfs/path/to/your/file.feather>'
df = pd.read_feather('</dbfs/path/to/your/file.feather>')
# Convert the pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(df)
- Enable Arrow-based columnar data transfers. In your cluster settings, under Advanced options > Spark tab, enter the following configuration in the Spark config field.
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For more information, refer to the Convert between PySpark and pandas DataFrames documentation.