Allow spaces and special characters in nested column names with Delta tables

Upgrade to Databricks Runtime 10.2 or later and use column mapping mode to allow spaces and special characters in column names.

Written by shanmugavel.chandrakasu

Last published at: October 26th, 2022

Problem

It is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names.

When you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an AnalysisException error message.

For example, if you try to read a JSON file, evaluate the DataFrame, and then write it out to a Delta table on DBR 10.2 or below it returns an error.

%scala

val df = spark.read.json("<path-to-JSON-file>") 
df.write.format("delta").mode("overwrite").save("<path-to-delta-table>")

The expected error message is visible in the stack trace.

AnalysisException: Attribute name "stage_info.Accumulables.Count Failed Values" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.

Cause

One of the nested column names in the DataFrame contains spaces, which is preventing you from writing the output to the Delta table.

Solution

If your source files are straightforward, you can use withColumnRenamed to rename multiple columns and remove spaces. However, this can quickly get complicated with a nested schema. 

withColumn can be used to flatten nested columns and rename the existing column (with spaces) to a new column name (without spaces). In case of a large schema, flattening all of the nested columns in the DataFrame can be a tedious task.

If your clusters are using Databricks Runtime 10.2 or above you can avoid the issue entirely by enabling column mapping mode. Column mapping mode allows the use of spaces as well as , ; { } ( ) \n \t = characters in table column names. 

Set the Delta table property delta.columnMapping.mode to name to enable column mapping mode.

This sample code sets up a Delta table that can support nested column names with spaces, however it does require a cluster running Databricks Runtime 10.2 or above.

%scala

import io.delta.tables.DeltaTable

val df = spark.read.json("<path-to-JSON-file>") 
DeltaTable.create()
    .addColumns(df.schema)
    .property("delta.minReaderVersion", "2")
    .property("delta.minWriterVersion", "5")
    .property("delta.columnMapping.mode", "name")
    .location("<path-to-delta-table>")
    .execute()

df.write.format("delta").mode("append").save("<path-to-delta-table>")