Problem
Typically to accommodate a memory-intensive workload and avoid out-of-memory (OOM) errors, you scale up the cluster node’s memory.
Note
If you’re looking for more information on scaling up, review the knowledge base article, Spark job fails with Driver is temporarily unavailable.
After scaling up, you notice the driver still fails with an unexpected stop and restart message.
The spark driver has stopped unexpectedly and is restarting.
While investigating, you notice a high frequency of garbage collection (GC) events which can be verified on the driver's log4j.
24/11/07 00:32:45 WARN DBRDebuggerEventReporter: Driver/10.XX.XX.XX paused the JVM process 81 seconds during the past 120 seconds (67.76%) because of GC. We observed 3 such issue(s) since 2024-11-07T00:26:26.301Z.
Additionally, the driver's stdout
may show full GC
messages.
[Full GC (Metadata GC Threshold) [PSYoungGen: 38397K->0K(439808K)] [ParOldGen: 351239K->108115K(1019904K)] 389636K->108115K(1459712K), [Metaspace: 252946K->252852K(1290240K)], 46.2830875 secs] [Times: user=0.74 sys=0.76, real=46.28 secs]
Cause
When scaling up a cluster’s memory doesn’t solve the memory issue, it rules out available memory and instead becomes a GC issue, which is often silent.
GC causes the driver to pause Java virtual machine (JVM) applications. If there is more memory potentially available to use, the GC takes more time to scan all objects and free that memory. These long pauses can lead to a forced restart of the machine.
Solution
Refactor your code to use less memory at once. You can use the Apache Spark parallelization pipeline, which helps to distribute the load between the cluster's nodes and avoid memory issues.
If your workload doesn’t use the Spark API, Databricks recommends partitioning non-API execution types. Consider iterating over objects and reusing references instead of instantiating many heavy objects at the same time during processing.