Problem
You are using DBConnect (AWS | Azure | GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error.
py4j.protocol.Py4JJavaError: An error occurred while calling o945.count. : java.lang.StackOverflowError at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Class.getEnclosingClass(Class.java:1272) at java.lang.Class.getSimpleBinaryName(Class.java:1443) at java.lang.Class.getSimpleName(Class.java:1309) at org.apache.spark.sql.types.DataType.typeName(DataType.scala:67) at org.apache.spark.sql.types.DataType.simpleString(DataType.scala:82) at org.apache.spark.sql.types.DataType.sql(DataType.scala:90) at org.apache.spark.sql.util.ProtoSerializer.serializeDataType(ProtoSerializer.scala:3207) at org.apache.spark.sql.util.ProtoSerializer.serializeAttrRef(ProtoSerializer.scala:3610) at org.apache.spark.sql.util.ProtoSerializer.serializeAttr(ProtoSerializer.scala:3600) at org.apache.spark.sql.util.ProtoSerializer.serializeNamedExpr(ProtoSerializer.scala:3537) at org.apache.spark.sql.util.ProtoSerializer.serializeExpr(ProtoSerializer.scala:2323) at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:3001) at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:2998)
Performing the same operation in a notebook works correctly and does not produce an error.
Example code
You can reproduce the error with this sample code.
It creates a DataFrame with 200 columns and renames them all.
This sample code runs correctly in a notebook, but results in an error when run in DBConnect.
%python df = spark.createDataFrame([{str(i) : i for i in range(2000)}]) df = spark.createDataFrame([{str(i) : i for i in range(200)}]) for col in df.columns: df = df.withColumnRenamed(col, col + "_a") df.collect()
Cause
When you run code in DBConnect, some functions are handled on the remote cluster driver, but some are handled locally on the client PC.
If enough memory is not allocated on the local PC, you get an error.
Solution
You should increase the memory allocated to the Apache Spark driver on the local PC.
- Run databricks-connect get-spark-home on your local PC to get the ${spark_home} value.
- Navigate to the ${spark_home}/conf/ folder.
- Open the spark-defaults.conf file.
- Add the following settings to the spark-defaults.conffile:
spark.driver.memory 4g spark.driver.extraJavaOptions -Xss32M
- Save the changes.
- Restart DBConnect.