Databricks spark-submit jobs appear to “hang” and clusters do not auto-terminate

Embed system.exit code in your application to shutdown the Java virtual machine with exit code 0.

Written by shubham.chhabra

Last published at: September 12th, 2024

Problem 

Databricks spark-submit jobs appear to “hang,” either after the user class’s main method completes (regardless of success or failure, for Java / Scala jobs), or upon a Python script exit (for Python jobs). Clusters do not auto-terminate in this case. 

Cause 

For Java, Scala, and Python programs, Spark does not automatically call System.exit

For context, the Java Virtual Machine initiates the shutdown sequence in response to one of three events.  

  1. When the number of live non-daemon threads drops to zero for the first time. 
  2. When the Runtime.exit or System.exit method is called for the first time.
  3. When an external event occurs such as an interrupt, or a signal is received from the operating system.

Solution

Embed system.exit code in your application to shutdown the Java virtual machine with exit code 0.

Examples

Python

import sys
sc = SparkSession.builder.getOrCreate().sparkContext  # Or otherwise obtain handle to SparkContext
runTheRestOfTheUserCode()
# Fall through to exit with code 0 in case of success, since failure will throw an uncaught exception
# and won't reach the exit(0) and thus will trigger a non-zero exit code that will be handled by
# PythonRunner
sc._gateway.jvm.System.exit(0)

Scala

def main(args: Array[String]): Unit = {
  try {
    runTheRestOfTheUserCode()
  } catch {
    case t: Throwable =>
      try {
        // Log the throwable or error here
      } finally {
        System.exit(1)
      }
  }
  System.exit(0)
}

Long-term fix

You can also track Spark’s long-term fix at SPARK-48547