Problem
When you try to execute an Apache Spark job, it fails with a GC overhead limit exceeded
error.
java.lang.OutOfMemoryError: GC overhead limit exceeded. The excessive GC triggered by join conditions results in the job being aborted with the error: Exception: Job aborted due to stage failure: Task in stage failed X times, most recent failure: Lost task in stage (TID) (executor): java.lang.OutOfMemoryError: GC overhead limit exceeded
Cause
NULL values in join keys can result in a high number of unmatched rows, and a high number of unmatched rows in turn increases memory. Duplicate records in join keys can increase the amount of data that needs to be shuffled and sorted during a sort-merge join.
These factors can lead to excessive memory usage and extended garbage collection (GC) activity, which ultimately triggers the java.lang.OutOfMemoryError: GC overhead limit exceeded
error.
Solution
First, analyze join columns. Check for NULL values and duplicates in your join keys to identify potential sources of the error.
SELECT COUNT(*) FROM <your-dataset-1> WHERE <your-join-key-1> IS NULL OR <your-join-key-2> IS NULL;
SELECT COUNT(*) FROM <your-dataset-2> WHERE <your-join-key-1> IS NULL OR <your-join-key-2> IS NULL;
Then, deduplicate join keys. Minimize data size during joins by removing duplicate rows based on the join keys.
df1 = df1.dropDuplicates(["<your-join-key->1", "<your-join-key-2>"])
df2 = df2.dropDuplicates(["<your-join-key-1>", "<your-join-key-2>"])