Apache Spark read fails with Corrupted parquet page error

Problem

You are trying to read data in Parquet or Delta format and you get a Corrupted parquet page error.

java.lang.RuntimeException: Corrupted parquet page
    at com.databricks.sql.io.parquet.NativeColumnReader.readBatchNative(Native Method)
    at com.databricks.sql.io.parquet.NativeColumnReader.readBatch(NativeColumnReader.java:477)
    at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:346)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:236)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:204)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
    at org.apache.spark.scheduler.Task.run(Task.scala:112)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Cause

This error can occur when the Parquet fast reader, native reader, and vectorized reader are enabled.

Solution

If you encounter the Corrupted parquet page error, you should disable the fast reader, native reader, and vectorized reader in your cluster or notebook, and then try the read operation again.

If the error persists, even with these options disabled, open a case with Databricks support.

Note

If you apply these changes at the notebook level, they only apply to the Spark context. If you apply the changes at the cluster level, they apply to all notebooks attached to the cluster.

Disable fast reader

Set spark.databricks.io.parquet.fastreader.enabled to false in the cluster’s Spark configuration to disable the fast Parquet reader at the cluster level.

You can also disable the fast reader at the notebook level by running:

spark.conf.set("spark.databricks.io.parquet.fastreader.enabled","false")

Disable native reader

Set spark.databricks.io.parquet.nativeReader.enabled to false in the cluster’s Spark configuration to disable the native Parquet reader at the cluster level.

You can also disable the native reader at the notebook level by running:

spark.conf.set("spark.databricks.io.parquet.nativeReader.enabled","false")

Disable vectorized reader

Set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration to disable the vectorized Parquet reader at the cluster level.

You can also disable the vectorized reader at the notebook level by running:

spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

Note

The vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact.