Schema mismatch issue while reading parquet files

Fix the file schema or read the files separately.

Written by lakshay.goel

Last published at: October 23rd, 2024

Problem

When trying to read data from a source directory containing multiple parquet files, you encounter an issue. 

s3://<file_path>/test_file.PARQUET. Schema conversion error: cannot convert Parquet type INT32 to Photon type string(0)

 

Cause

There is a schema mismatch between two parquet files in the same source directory. 

When Databricks attempts to read the files and unify their schemas, it encounters a type mismatch, which leads to the error.

 

Solution

Fix the file schema. Identify the columns with schema discrepancies and modify them to have a consistent data type across all files. 

If modifying the files is not an option, you can read the files separately and then union them. This approach allows you to handle schema differences. 

 

Note

This solution will not work for data type differences like timestamp and int. In that case you should correct the file or put the data in two separate tables.