Inconsistent job performance with shuffle-intensive workloads

Use an instance series with locally-attached NVMe SSD.

Last published at: June 4th, 2025

Problem

You are running a job on a general-purpose, compute-optimized, or memory-optimized instance series such as M5, M6i, C5, C6i, R5, R6i.

The job involves heavy shuffle read/write operations due to transformations such as groupByKey, join, distinct, repartition, reduceByKey, and sortBy. The job consequently exhibits inconsistent runtimes even with a consistent input data size.

Cause

The previously-listed instance series do not have locally-attached SSDs, instead relying on network-attached EBS volumes for shuffle storage.

The reliance on network I/O introduces additional latency and contention, especially during intensive shuffle operations, leading to non-deterministic execution times influenced by network bandwidth availability.

Solution

Use an instance series such as:

M5d
M6id
C5d
C6id
R5d
R6id

Any of these are equipped with locally-attached NVMe (Non-Volatile Memory Express) SSDs.

These instances series provide high-throughput, low-latency local storage, significantly improving shuffle performance and ensuring more stable and predictable job runtimes for shuffle-intensive workloads.

Databricks Help Center

Problem

Cause

Solution

Contact Us