Incompatible schema in some files

Learn how to resolve incompatible schema in Parquet files with Databricks.

Written by Adam Pavlacka

Last published at: May 31st, 2022

Problem

The Spark job fails with an exception like the following while reading Parquet files:

Error in SQL statement: SparkException: Job aborted due to stage failure:
Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0
(TID 868031, 10.111.245.219, executor 31):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
    at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)

Cause

The java.lang.UnsupportedOperationException in this instance is caused by one or more Parquet files written to a Parquet folder with an incompatible schema.

Solution

Find the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled:

%scala

spark.read.option("mergeSchema", "true").parquet(path)

or

%scala

spark.conf.set("spark.sql.parquet.mergeSchema", "true")
spark.read.parquet(path)

If you do have Parquet files with incompatible schemas, the snippets above will output an error with the name of the file that has the wrong schema.

You can also check if two schemas are compatible by using the merge method. For example, let’s say you have these two schemas:

%scala

import org.apache.spark.sql.types._

val struct1 = (new StructType)
  .add("a", "int", true)
  .add("b", "long", false)

val struct2 = (new StructType)
  .add("a", "int", true)
  .add("b", "long", false)
  .add("c", "timestamp", true)

Then you can test if they are compatible:

%scala

struct1.merge(struct2).treeString

This will give you:

%scala

res0: String =
"root
|-- a: integer (nullable = true)
|-- b: long (nullable = false)
|-- c: timestamp (nullable = true)
"

However, if struct2 has the following incompatible schema:

%scala

val struct2 = (new StructType)
  .add("a", "int", true)
  .add("b", "string", false)

Then the test will give you the following SparkException:

org.apache.spark.SparkException: Failed to merge fields 'b' and 'b'. Failed to merge incompatible data types LongType and StringType


Was this article helpful?