Recurring Apache Spark jobs with same data set size and cluster configuration vary in duration

Build your cluster with sufficient SSD memory, monitor your cluster’s disk usage, and optimize data storage.

Last published at: March 12th, 2025

Problem

Regularly recurring Apache Spark jobs vary in duration despite using the same cluster configuration and approximately the same data set size. You notice run to run duration can be twice as long or more. Your execution plan is the same for every run and your cluster metrics don’t indicate any performance issue.

Cause

Your initial cluster type choice initially provides a local solid state drive (SSD) disk. However, if the disk expands, the cluster sometimes uses a hard disk drive (HDD) instead of an SSD. Because an HDD disk is slower than an SSD, local disk data processing is slower, which increases job duration. The cluster metrics don’t report the local disk throughput speed.

Solution

Navigate to the job’s cluster details link to review the event log and check for a disk expansion report for Spark jobs with unexpected longer durations.

Compare the event log to normal duration job logs. If you see disk expansion only for the longer job durations, the expansion disks are of type HDD.

After confirming disk expansion only occurs in the jobs with longer durations:

Ensure that the cluster type used has sufficient SSD memory, avoiding the need for disk expansion. Select a cluster configuration that provides ample SSD storage.
Regularly monitor your clusters’ disk usage to ensure that they are not expanding to HDDs. You can use the Databricks UI or use your cloud provider’s tools for tracking disk types and usage.

Where possible, optimize your data storage to reduce the need for disk expansion. This can involve compressing data, partitioning data more effectively, using OPTIMIZE regularly on Delta tables, or using a more compact compression type.

Databricks Help Center

Problem

Cause

Solution

Contact Us