Problem
When using the Apache Spark withColumn
operation multiple times (such as looping), you notice slow performance or a StackOverflowException
.
Cause
Each withColumn
call introduces a new projection internally, which generates large execution plans.
Solution
Use a select operation on multiple columns at once. Select casts all columns to IntegerType more efficiently by performing all transformations in a single operation.
val df2 = df1.select(df1.columns.map { col =>
df1(col).cast(IntegerType)
}: _*)
For more information, please review the Spark Dataset documentation under the withColumn
section.