Serialized task is too large

Learn what to do when a serialized task is too large in Databricks.

Written by Adam Pavlacka

Last published at: May 11th, 2022

If you see the follow error message, you may be able to fix this error by changing the Spark config (AWS | Azure) when you start the cluster.

Serialized task XXX:XXX was XXX bytes, which exceeds max allowed: spark.rpc.message.maxSize (XXX bytes).
Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.

To change the Spark config, set the property:

spark.rpc.message.maxSize

While tuning the configuration is one option, typically this error message means that you send some large objects from the driver to executors, e.g., call parallelize with a large list, or convert a large R DataFrame to a Spark DataFrame.

If so, we recommend first auditing your code to remove large objects that you use, or leverage broadcast variables instead. If that does not resolve this error, you can increase the partition number to split the large list to multiple small ones to reduce the Spark RPC message size.

Here are examples for Python and Scala:

Python

largeList = [...] # This is a large list
partitionNum = 100 # Increase this number if necessary
rdd = sc.parallelize(largeList, partitionNum)
ds = rdd.toDS()
Delete

Scala

val largeList = Seq(...) // This is a large list
val partitionNum = 100 // Increase this number if necessary
val rdd = sc.parallelize(largeList, partitionNum)
val ds = rdd.toDS()
Delete

R users need to increase the Spark configuration spark.default.parallelism to increase the partition number at cluster initialization. You cannot set this configuration after cluster creation.