Error java.io.FileNotFoundException when job attempts to read or write intermediary files

Cache the DataFrame before performing write operations and ensure you are using a compatible version of the com.crealytics.spark.excel library.

Written by John Benninghoff

Last published at: November 14th, 2024

Problem

When your job attempts to read or write intermediary files, such as Excel files, in the Databricks File System (DBFS), you encounter a java.io.FileNotFoundException error. 

The error message may include paths with the workspace ID, such as /00000000000000/<file-store>/<your-project>/<your-spreadsheet>.xlsx

 

Cause

The job is attempting to access a file that has been deleted or is in the process of being deleted. 

Incorrect or expired storage account authentication can also cause a similar file access issue.  In such cases, the error message includes an unauthorized error. 

 

Solution

Cache the DataFrame before performing write operations. This ensures that the data is fully loaded into memory before any file operations are attempted. Add the following line before the write operation. 

df.cache().show()

 

Also, ensure that you are using a version of the com.crealytics.spark.excel library compatible with your Databricks Runtime version. For Databricks Runtime 13.3, use the following Maven coordinate. 

com.crealytics:spark-excel_2.12:3.4.1_0.20.4

 

For Maven versions for other Databricks Runtime versions, refer to Maven Central’s com.crealytics documentation. 

 

Note

If possible, use CSV format for intermediary storage instead of Excel because Apache Spark has native support for CSV files.