Problem
You are running a job on a general-purpose, compute-optimized, or memory-optimized instance series such as M5, M6i, C5, C6i, R5, R6i.
The job involves heavy shuffle read/write operations due to transformations such as groupByKey
, join
, distinct
, repartition
, reduceByKey
, and sortBy
. The job consequently exhibits inconsistent runtimes even with a consistent input data size.
Cause
The previously-listed instance series do not have locally-attached SSDs, instead relying on network-attached EBS volumes for shuffle storage.
The reliance on network I/O introduces additional latency and contention, especially during intensive shuffle operations, leading to non-deterministic execution times influenced by network bandwidth availability.
Solution
Use an instance series such as:
- M5d
- M6id
- C5d
- C6id
- R5d
- R6id
Any of these are equipped with locally-attached NVMe (Non-Volatile Memory Express) SSDs.
These instances series provide high-throughput, low-latency local storage, significantly improving shuffle performance and ensuring more stable and predictable job runtimes for shuffle-intensive workloads.