Problem
Databricks spark-submit
jobs appear to “hang,” either after the user class’s main method completes (regardless of success or failure, for Java / Scala jobs), or upon a Python script exit (for Python jobs). Clusters do not auto-terminate in this case.
Cause
For Java, Scala, and Python programs, Spark does not automatically call System.exit
.
For context, the Java Virtual Machine initiates the shutdown sequence in response to one of three events.
- When the number of live non-daemon threads drops to zero for the first time.
- When the
Runtime.exit
orSystem.exit
method is called for the first time. - When an external event occurs such as an interrupt, or a signal is received from the operating system.
Solution
Embed system.exit
code in your application to shutdown the Java virtual machine with exit code 0.
Examples
Python
import sys
sc = SparkSession.builder.getOrCreate().sparkContext # Or otherwise obtain handle to SparkContext
runTheRestOfTheUserCode()
# Fall through to exit with code 0 in case of success, since failure will throw an uncaught exception
# and won't reach the exit(0) and thus will trigger a non-zero exit code that will be handled by
# PythonRunner
sc._gateway.jvm.System.exit(0)
Scala
def main(args: Array[String]): Unit = {
try {
runTheRestOfTheUserCode()
} catch {
case t: Throwable =>
try {
// Log the throwable or error here
} finally {
System.exit(1)
}
}
System.exit(0)
}
Long-term fix
You can also track Spark’s long-term fix at SPARK-48547.