Unknown Apache Spark internal error when running Delta table queries

Reorganize the folder structure for the specific partition causing the problem.

Written by Guilherme Leite

Last published at: November 4th, 2024

Problem

While performing a query on a Delta Table, you encounter an error. 

[INTERNAL_ERROR] The Spark SQL phase planning failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.

 

This error may appear consistently for a specific table, even when limiting the number of rows in the SELECT statement to a small number. 

 

Cause

This is a generic Apache Spark error message that does not indicate a specific cause. In this case, we can confirm the cause by checking the full stack trace in the driver's log4j logs.

(…)
Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
    Partition column name list #0: partition_date
    Partition column name list #1: partition_date, partition_date
For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:
(…)

 

The root cause in this case is a problematic hierarchy of the source table's folders/files in a specific partition. Spark expects a specific folder structure for partitioned tables, and any deviation from this structure, even for a single partition, can lead to internal errors. If this is the case, the error only appears when the problematic partition is being consulted. 

 

Solution

Reorganize the folder structure for the specific partition causing the problem. In general: 

  • Ensure that there are no parquet files in non-leaf directories.
  • For partitioned table directories, data files should only live in leaf directories. 
  • Ensure that all partitions of the same name are in the same level. 
  • Directories at the same level should have the same partition column name.

 

Example 

The following directory list has unexpected files or inconsistent partition column names. 

dbfs:/your-table/partition_date=2022-01-01
dbfs:/your-table/partition_date=2022-01-02
(...)
dbfs:/your-table/partition_date=2022-09-03/partition_date=2022-09-02
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:576)
(...)

 

In this case, the last partition listed has a problem.

dbfs:/your-table/partition_date=2022-09-03/partition_date=2022-09-02

 

This folder should be broken down into:

dbfs:/your-table/partition_date=2022-09-02
dbfs:/your-table/partition_date=2022-09-03

 

To prevent similar issues in the future, ensure that the folder structure for partitioned tables follows Spark's expectations.