Problem
Regularly recurring Apache Spark jobs vary in duration despite using the same cluster configuration and approximately the same data set size. You notice run to run duration can be twice as long or more. Your execution plan is the same for every run and your cluster metrics don’t indicate any performance issue.
Cause
Your initial cluster type choice initially provides a local solid state drive (SSD) disk. However, if the disk expands, the cluster sometimes uses a hard disk drive (HDD) instead of an SSD. Because an HDD disk is slower than an SSD, local disk data processing is slower, which increases job duration. The cluster metrics don’t report the local disk throughput speed.
Solution
Navigate to the job’s cluster details link to review the event log and check for a disk expansion report for Spark jobs with unexpected longer durations.
Compare the event log to normal duration job logs. If you see disk expansion only for the longer job durations, the expansion disks are of type HDD.
After confirming disk expansion only occurs in the jobs with longer durations:
- Ensure that the cluster type used has sufficient SSD memory, avoiding the need for disk expansion. Select a cluster configuration that provides ample SSD storage.
- Regularly monitor your clusters’ disk usage to ensure that they are not expanding to HDDs. You can use the Databricks UI or use your cloud provider’s tools for tracking disk types and usage.
Where possible, optimize your data storage to reduce the need for disk expansion. This can involve compressing data, partitioning data more effectively, using OPTIMIZE
regularly on Delta tables, or using a more compact compression type.