Auto Loader streaming job failure with schema inference error

To selectively read a specific type of file using Auto Loader, use the pathGlobFilter option.

Written by harikrishnan.kunhumveettil

Last published at: February 29th, 2024

Problem

You have an Apache Spark streaming job using Auto Loader encounter an error stating:

Schema inference for the 'parquet' format from the existing files in the input path <Root Folder> has failed

Cause

One possible cause for this issue is having multiple types of files in the child directories. The input directory structure includes a root folder containing nested directories such as folder A and folder B, each containing various file formats.

Root Folder -> Folder A -> Folder B -> Avro files (*.avro)
Root Folder -> Folder A -> Folder C -> Parquet files (*.parquet)

Solution

To selectively read a specific type of file using Auto Loader from a directory with diverse file formats, use the pathGlobFilter option.

For example, you can use .option("pathGlobfilter", "*.parquet") to set a suffix pattern for Parquet files, ensuring that only Parquet files are processed. 

For more information, review the Filtering directories or files using glob patterns (AWS | Azure | GCP).