Problem
When you try to save your data, your Apache Spark job fails with the below error.
Caused by: java.io.IOException: Compressed buffer size exceeds 2147483647. The size of individual input values might be too large. Lower page/block row size checks to write data more often
at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:83)
at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
Cause
You have individual records which exceed the 2 GB buffer size limit. The Parquet writer groups records together and checks the block size to determine when to close the row group. However, when a single record is too large, it can cause the buffer size to overflow.
Solution
- Navigate to your cluster.
- Click Advanced options.
- In the Spark config box under the Spark tab, add the following configuration settings to adjust the Parquet page and block sizes.
Spark.hadoop.parquet.page.size.row.check.max 1 spark.hadoop.parquet.block.size.row.check.max 1 spark.hadoop.parquet.page.size.row.check.min 1 spark.hadoop.parquet.block.size.row.check.min 1
These configurations increase the frequency of row group size checks in Parquet files. The default value for these configs is 10
.