Apache Spark UI is not in sync with job

Status of Spark jobs gets out of sync with the Spark UI when events drop from the event queue before being processed.

Written by chetan.kardekar

Last published at: February 27th, 2023

Problem

The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank.

When you review the driver logs, you see an AsyncEventQueue warning.

Logs
=====
20/12/23 21:20:26 WARN AsyncEventQueue: Dropped 93909 events from shared since Wed Dec 23 21:19:26 UTC 2020. 
20/12/23 21:21:26 WARN AsyncEventQueue: Dropped 52354 events from shared since Wed Dec 23 21:20:26 UTC 2020. 
20/12/23 21:22:26 WARN AsyncEventQueue: Dropped 94137 events from shared since Wed Dec 23 21:21:26 UTC 2020. 
20/12/23 21:23:26 WARN AsyncEventQueue: Dropped 44245 events from shared since Wed Dec 23 21:22:26 UTC 2020. 
20/12/23 21:24:26 WARN AsyncEventQueue: Dropped 126763 events from shared since Wed Dec 23 21:23:26 UTC 2020.
20/12/23 21:25:26 WARN AsyncEventQueue: Dropped 94156 events from shared since Wed Dec 23 21:24:26 UTC 2020. 
Delete

Info

This is related to the Apache Spark UI shows wrong number of jobs KB article.

Cause

  • All Spark jobs, stages, and tasks are pushed to the event queue.
  • The backend listener reads the Spark UI events from this queue and renders the Spark UI.
  • The default capacity of the event queue (spark.scheduler.listenerbus.eventqueue.capacity) is 20000.

If more events are pushed to the event queue than the backend listener can consume, the oldest events get dropped from the queue and the listener never consumes them.

These events are lost and do not get rendered in the Spark UI.

Solution

Set the value of spark.scheduler.listenerbus.eventqueue.capacity in your cluster’s Spark config (AWS | Azure | GCP) at cluster level to a value greater than 20000.

This value sets the capacity for the app status event queue, which holds events for internal application status listeners. Increasing this value allows the event queue to hold a larger number of events, but may result in the driver using more memory.

Was this article helpful?