Column name error when using Apache Spark Mlib feature transformers

When flattening the DataFrame, rename nested columns using an underscore instead of a dot.

Written by Shyamprasad Miryala

Last published at: November 15th, 2024

Problem

When trying to use the Apache Spark MLlib StringIndexer, or other feature transformers with columns that are nested (meaning they have dots in their names, like columnname_part1.columnname_part2), you receive an error that the column name cannot be resolved.  

 

org.apache.spark.sql.AnalysisException: Cannot resolve column name "columnname_part1.columnname_part2" among (columnname_part1.columnname_part2, columnname_part1.columnname_part3); did you mean to quote the `columnname_part1.columnname_part2` column?

 

Alternatively you receive an error that the column does not exist.   

 

org.apache.spark.SparkException: Input column `columnname_part1.columnname_part2` does not exist.

 

Cause

Spark uses dots to identify nested structures in data (like part1 and part2 in columnname_part1.columnname_part2). When there’s a dot in the column name, Spark tries to treat the name as nested data. 

 

Solution

Rename the nested columns by replacing dots with underscores. Spark no longer sees the columns as nested, removing the need to use backticks.

 

columnname_part1_columnname_part2 

 

Example in context

val si = new StringIndexer().setInputCol("columnname_part1_columnname_part2").setOutputCol("columnname_part1_indexed")
val pipeline = new Pipeline().setStages(Array(si))
pipeline.fit(flattenedDf).transform(flattenedDf).show()

 

To avoid this issue in the future, use underscores as a standard replacement for dots in column names across your data. 

 

Important

Approaches like wrapping column names in backticks or flattening nested data may help, but have disadvantages. 

Wrapping column names in backticks makes the code harder to read and manage.

Flattening the nested data by splitting it into separate columns with unique names creates a lot of new columns, making the data harder to work with and slowing down processing.