Problem
The Spark job fails with an exception like the following while reading Parquet files:
Error in SQL statement: SparkException: Job aborted due to stage failure: Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0 (TID 868031, 10.111.245.219, executor 31): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)
Cause
The java.lang.UnsupportedOperationException in this instance is caused by one or more Parquet files written to a Parquet folder with an incompatible schema.
Solution
Find the Parquet files and rewrite them with the correct schema. Try to read the Parquet dataset with schema merging enabled:
%scala spark.read.option("mergeSchema", "true").parquet(path)
or
%scala spark.conf.set("spark.sql.parquet.mergeSchema", "true") spark.read.parquet(path)
If you do have Parquet files with incompatible schemas, the snippets above will output an error with the name of the file that has the wrong schema.
You can also check if two schemas are compatible by using the merge method. For example, let’s say you have these two schemas:
%scala import org.apache.spark.sql.types._ val struct1 = (new StructType) .add("a", "int", true) .add("b", "long", false) val struct2 = (new StructType) .add("a", "int", true) .add("b", "long", false) .add("c", "timestamp", true)
Then you can test if they are compatible:
%scala struct1.merge(struct2).treeString
This will give you:
%scala res0: String = "root |-- a: integer (nullable = true) |-- b: long (nullable = false) |-- c: timestamp (nullable = true) "
However, if struct2 has the following incompatible schema:
%scala val struct2 = (new StructType) .add("a", "int", true) .add("b", "string", false)
Then the test will give you the following SparkException:
org.apache.spark.SparkException: Failed to merge fields 'b' and 'b'. Failed to merge incompatible data types LongType and StringType