Apache Spark job failing with GC overhead limit exceeded error

Analyze JOIN columns and deduplicate JOIN keys.

Written by nelavelli.durganagajahnavi

Last published at: April 30th, 2025

Problem

When you try to execute an Apache Spark job, it fails with a GC overhead limit exceeded error.  

java.lang.OutOfMemoryError: GC overhead limit exceeded. The excessive GC triggered by join conditions results in the job being aborted with the error: Exception: Job aborted due to stage failure: Task in stage failed X times, most recent failure: Lost task in stage (TID) (executor): java.lang.OutOfMemoryError: GC overhead limit exceeded

 

Cause

NULL values in join keys can result in a high number of unmatched rows, and a high number of unmatched rows in turn increases memory. Duplicate records in join keys can increase the amount of data that needs to be shuffled and sorted during a sort-merge join. 

 

These factors can lead to excessive memory usage and extended garbage collection (GC) activity, which ultimately triggers the java.lang.OutOfMemoryError: GC overhead limit exceeded error.

 

Solution

First, analyze join columns. Check for NULL values and duplicates in your join keys to identify potential sources of the error.

SELECT COUNT(*) FROM <your-dataset-1> WHERE <your-join-key-1> IS NULL OR <your-join-key-2> IS NULL;
SELECT COUNT(*) FROM <your-dataset-2> WHERE <your-join-key-1> IS NULL OR <your-join-key-2> IS NULL;

 

Then, deduplicate join keys. Minimize data size during joins by removing duplicate rows based on the join keys.

df1 = df1.dropDuplicates(["<your-join-key->1", "<your-join-key-2>"])
df2 = df2.dropDuplicates(["<your-join-key-1>", "<your-join-key-2>"])