Problem
When trying to use the Apache Spark MLlib StringIndexer
, or other feature transformers with columns that are nested (meaning they have dots in their names, like columnname_part1.columnname_part2
), you receive an error that the column name cannot be resolved.
org.apache.spark.sql.AnalysisException: Cannot resolve column name "columnname_part1.columnname_part2" among (columnname_part1.columnname_part2, columnname_part1.columnname_part3); did you mean to quote the `columnname_part1.columnname_part2` column?
Alternatively you receive an error that the column does not exist.
org.apache.spark.SparkException: Input column `columnname_part1.columnname_part2` does not exist.
Cause
Spark uses dots to identify nested structures in data (like part1
and part2
in columnname_part1.columnname_part2
). When there’s a dot in the column name, Spark tries to treat the name as nested data.
Solution
Rename the nested columns by replacing dots with underscores. Spark no longer sees the columns as nested, removing the need to use backticks.
columnname_part1_columnname_part2
Example in context
val si = new StringIndexer().setInputCol("columnname_part1_columnname_part2").setOutputCol("columnname_part1_indexed")
val pipeline = new Pipeline().setStages(Array(si))
pipeline.fit(flattenedDf).transform(flattenedDf).show()
To avoid this issue in the future, use underscores as a standard replacement for dots in column names across your data.
Important
Approaches like wrapping column names in backticks or flattening nested data may help, but have disadvantages.
Wrapping column names in backticks makes the code harder to read and manage.
Flattening the nested data by splitting it into separate columns with unique names creates a lot of new columns, making the data harder to work with and slowing down processing.