Apache Spark driver failing with unexpected stop and restart message

Reduce the size of the objects being collected or processed in parallel.

Last published at: January 16th, 2025

Problem

Typically to accommodate a memory-intensive workload and avoid out-of-memory (OOM) errors, you scale up the cluster node’s memory.

Note

If you’re looking for more information on scaling up, review the knowledge base article, Spark job fails with Driver is temporarily unavailable.

After scaling up, you notice the driver still fails with an unexpected stop and restart message.

The spark driver has stopped unexpectedly and is restarting.

While investigating, you notice a high frequency of garbage collection (GC) events which can be verified on the driver's log4j.

24/11/07 00:32:45 WARN DBRDebuggerEventReporter: Driver/10.XX.XX.XX paused the JVM process 81 seconds during the past 120 seconds (67.76%) because of GC. We observed 3 such issue(s) since 2024-11-07T00:26:26.301Z.

Additionally, the driver's stdout may show full GC messages.

[Full GC (Metadata GC Threshold) [PSYoungGen: 38397K->0K(439808K)] [ParOldGen: 351239K->108115K(1019904K)] 389636K->108115K(1459712K), [Metaspace: 252946K->252852K(1290240K)], 46.2830875 secs] [Times: user=0.74 sys=0.76, real=46.28 secs]

Cause

When scaling up a cluster’s memory doesn’t solve the memory issue, it rules out available memory and instead becomes a GC issue, which is often silent.

GC causes the driver to pause Java virtual machine (JVM) applications. If there is more memory potentially available to use, the GC takes more time to scan all objects and free that memory. These long pauses can lead to a forced restart of the machine.

Solution

Refactor your code to use less memory at once. You can use the Apache Spark parallelization pipeline, which helps to distribute the load between the cluster's nodes and avoid memory issues.

If your workload doesn’t use the Spark API, Databricks recommends partitioning non-API execution types. Consider iterating over objects and reusing references instead of instantiating many heavy objects at the same time during processing.

Databricks Help Center

Problem

Note

Cause

Solution

Contact Us