Problem
When inserting data into a Delta table with a schema that contains a StructField
of type NULL
, you encounter an InvalidSchemaException
.
Example Error Message
Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 22) (10.101.191.43 executor 0): org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: optional group <field-name> {}
Cause
Empty STRUCT fields are not permitted in Parquet format.
The issue arises when a StructField is defined with an empty StructType. In the following example, the col3
field is defined as a STRUCT with no fields.
from pyspark.sql.types import StructType, StructField, FloatType
schema = StructType([
StructField("col1", FloatType(), nullable=True),
StructField("col2", FloatType(), nullable=True),
StructField("col3", StructType([]), nullable=True)
])
Solution
Define a field type for any fields that use a StructType
within a StructField
.
Example
schema = StructType([
StructField("col1", FloatType(), nullable=True),
StructField("col2", FloatType(), nullable=True),
StructField("col3", StructType([StructField("nested_col",
StringType())]), nullable=True)
])
For more information, refer to the What is a view? (AWS | Azure | GCP) documentation.