ProtoSerializer stack overflow error in DBConnect

A stack overflow error in DBConnect indicates that you need to allocate more memory on the local PC.

Written by ashritha.laxminarayana

Last published at: May 9th, 2022

Problem

You are using DBConnect (AWS | Azure | GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error.

py4j.protocol.Py4JJavaError: An error occurred while calling o945.count.
: java.lang.StackOverflowError
    at java.lang.Class.getEnclosingMethodInfo(Class.java:1072)
    at java.lang.Class.getEnclosingClass(Class.java:1272)
    at java.lang.Class.getSimpleBinaryName(Class.java:1443)
    at java.lang.Class.getSimpleName(Class.java:1309)
    at org.apache.spark.sql.types.DataType.typeName(DataType.scala:67)
    at org.apache.spark.sql.types.DataType.simpleString(DataType.scala:82)
    at org.apache.spark.sql.types.DataType.sql(DataType.scala:90)
    at org.apache.spark.sql.util.ProtoSerializer.serializeDataType(ProtoSerializer.scala:3207)
    at org.apache.spark.sql.util.ProtoSerializer.serializeAttrRef(ProtoSerializer.scala:3610)
    at org.apache.spark.sql.util.ProtoSerializer.serializeAttr(ProtoSerializer.scala:3600)
    at org.apache.spark.sql.util.ProtoSerializer.serializeNamedExpr(ProtoSerializer.scala:3537)
    at org.apache.spark.sql.util.ProtoSerializer.serializeExpr(ProtoSerializer.scala:2323)
    at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:3001)
    at org.apache.spark.sql.util.ProtoSerializer$$anonfun$$nestedInanonfun$serializeCanonicalizable$1$1.applyOrElse(ProtoSerializer.scala:2998)

Performing the same operation in a notebook works correctly and does not produce an error.

Example code

You can reproduce the error with this sample code.

It creates a DataFrame with 200 columns and renames them all.

This sample code runs correctly in a notebook, but results in an error when run in DBConnect.

%python

df = spark.createDataFrame([{str(i) : i for i in range(2000)}])
df = spark.createDataFrame([{str(i) : i for i in range(200)}])
for col in df.columns:
  df = df.withColumnRenamed(col, col + "_a")
df.collect()

Cause

When you run code in DBConnect, some functions are handled on the remote cluster driver, but some are handled locally on the client PC.

If enough memory is not allocated on the local PC, you get an error.

Solution

You should increase the memory allocated to the Apache Spark driver on the local PC.

  1. Run databricks-connect get-spark-home on your local PC to get the ${spark_home} value.
  2. Navigate to the ${spark_home}/conf/ folder.
  3. Open the spark-defaults.conf file.
  4. Add the following settings to the spark-defaults.conffile:
    spark.driver.memory 4g
    spark.driver.extraJavaOptions -Xss32M
  5. Save the changes.
  6. Restart DBConnect.
Delete

Warning

DBConnect only works with supported Databricks Runtime versions. Ensure that you are using a supported runtime on your cluster before using DBConnect.