How to handle corrupted Parquet files with different schema

Learn how to read Parquet files with a specific schema using Databricks.

Written by Adam Pavlacka

Last published at: May 31st, 2022

Problem

Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. You want to read only those files that match a specific schema and skip the files that don’t match.

One solution could be to read the files in sequence, identify the schema, and union the DataFrames together. However, this approach is impractical when there are hundreds of thousands of files.

Solution

Set the Apache Spark property spark.sql.files.ignoreCorruptFiles to true and then read the files with the desired schema. Files that don’t match the specified schema are ignored. The resultant dataset contains only data from those files that match the specified schema.

Set the Spark property using spark.conf.set:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

Alternatively, you can set this property in your Spark config (AWS | Azure | GCP).