Problem
You have a table with a given number of columns of a given data type, to which you are writing a Parquet file. When you run a job to insert the additional Parquet file, the job fails with an error.
[FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH] Error while reading file dbfs:/<path-to-parquet-file>.snappy.parquet. Data type mismatches when reading Parquet column . Expected Spark type <data-type>, actual Parquet type <data-type>. SQLSTATE: KD001
Cause
The Parquet file's schema contains data types (for example, string) that are different from the expected Apache Spark data types (for example, integer) in the original table.
The table schema is validated during file read (not write), and if there is a data type mismatch at that point the error occurs.
Solution
1. Examine the error message and identify the Parquet file that is causing the issue.
2. Examine the schema of the Parquet file using Spark's built-in Parquet reader. Compare the Parquet file’s schema with the original table’s schema.
3. Fix the Parquet file’s schema by re-writing the data to a separate DataFrame with the correct schema. Use Spark's built-in functions such as `withColumn`
or `cast`
to convert the data types to the expected Spark data types.
4. Insert the records from the fixed DataFrame back into the original table. Use the following code.
%python
from pyspark.sql.functions import col
change_schema = spark.read.parquet(<your-parquet-file>)
change_schema = change_schema.withColumn("id", col("id").cast("bigint"))
change_schema.write.mode("append").format("parquet").save(<path-of-parquet-table>)
Preventative measures
1. Validate the schema of the data before ingesting it into Databricks.
2. Use data type conversions to ensure that the data types in the Parquet files match the expected Spark data types.