Cannot grow BufferHolder; exceeds size limitation

Problem

Your Apache Spark job fails with an IllegalArgumentException: Cannot grow BufferHolder error.

java.lang.IllegalArgumentException: Cannot grow BufferHolder by size XXXXXXXXX because the size after growing exceeds size limitation 2147483632

Cause

BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB).

If a column value exceeds this size, Spark returns the exception.

This can happen when using aggregates like collect_list.

This example code generates duplicates in the column values which exceed the maximum size of BufferHolder. As a result, it returns an IllegalArgumentException: Cannot grow BufferHolder error when run in a notebook.

import org.apache.spark.sql.functions._
spark.range(10000000).withColumn("id1",lit("jkdhdbjasdshdjkqgdkdkasldksashjckabacbaskcbakshckjasbc$%^^&&&&&*jxcfdkwbfkjwdqndlkjqslkndskbndkjqbdjkbqwjkdbxnsa xckqjwbdxsabvnxbaskxqbhwdhqjskdjxbqsjdhqkjsdbkqsjdkjqdhkjqsabcxns ckqjdkqsbcxnsab ckjqwbdjckqscx ns csjhdjkqsdhjkqshdjsdhqksjdhxqkjshjkshdjkqsdhkjqsdhjqskxb kqscbxkjqsc")).groupBy("id1").
agg(collect_list("id1").alias("days")).
show()

Solution

You must ensure that column values do not exceed 2147483632 bytes. This may require you to adjust how you process data in your notebook.

Looking at our example code, using collect_set instead of collect_list, resolves the issue and allows the example to run to completion. This single change works because the example data set contains a large number of duplicate entries.

import org.apache.spark.sql.functions._
spark.range(10000000).withColumn("id1",lit("jkdhdbjasdshdjkqgdkdkasldksashjckabacbaskcbakshckjasbc$%^^&&&&&*jxcfdkwbfkjwdqndlkjqslkndskbndkjqbdjkbqwjkdbxnsa xckqjwbdxsabvnxbaskxqbhwdhqjskdjxbqsjdhqkjsdbkqsjdkjqdhkjqsabcxns ckqjdkqsbcxnsab ckjqwbdjckqscx ns csjhdjkqsdhjkqshdjsdhqksjdhxqkjshjkshdjkqsdhkjqsdhjqskxb kqscbxkjqsc")).groupBy("id1").
agg(collect_set("id1").alias("days")).
show()

If using collect_set does not keep the size of the column below the BufferHolder limit of 2147483632 bytes, the IllegalArgumentException: Cannot grow BufferHolder error still occurs. In this case, we would have to split the list into multiple DataFrames and write it out as separate files.