Problem
You have an Apache Spark streaming job using Auto Loader encounter an error stating:
Schema inference for the 'parquet' format from the existing files in the input path <Root Folder> has failed
Cause
One possible cause for this issue is having multiple types of files in the child directories. The input directory structure includes a root folder containing nested directories such as folder A and folder B, each containing various file formats.
Root Folder -> Folder A -> Folder B -> Avro files (*.avro)
Root Folder -> Folder A -> Folder C -> Parquet files (*.parquet)
Solution
To selectively read a specific type of file using Auto Loader from a directory with diverse file formats, use the pathGlobFilter
option.
For example, you can use .option("pathGlobfilter", "*.parquet")
to set a suffix pattern for Parquet files, ensuring that only Parquet files are processed.
For more information, review the Filtering directories or files using glob patterns (AWS | Azure | GCP).