Problem
You are running SparkSQL/PySpark code which uses broadcast hints. It takes longer to run than on previous Databricks Runtimes and/or fails with an out of memory error message.
Example code:
df.join(broadcast(bigDf)).write.mode('overwrite').parquet("path")
Error message:
If you check the execution plan under the SQL tab in the Apache Spark UI, it indicates the failure occurred during an executor broadcast join.
Cause
Executor side broadcast join (EBJ) is an enhancement introduced in Databricks Runtime 11.3 LTS. It optimizes the way a broadcast join functions. Previously, broadcast joins relied on the Spark driver to broadcast one side of the join. While this approach worked, it presented the following challenges:
- Single point of failure: The driver, as the coordinator of all queries in a cluster, faced an increased risk of failure due to JVM out of memory errors when used for broadcasting across all concurrent queries. In such a case, all queries running simultaneously on the cluster could be affected and potentially fail.
- Limited flexibility in increasing default broadcast join thresholds: The risk of out of memory errors on the driver (due to concurrent broadcasts) made it difficult to increase the default broadcast join thresholds (currently 10MB static and 30MB dynamic), as it would add more memory pressure on the driver and elevate the risk of failure.
The EBJ enhancement addresses these challenges by shifting the memory pressure from the driver to the executors. This not only eliminates a single point of failure but also allows for increasing the default broadcast join thresholds across the board, thanks to the reduced risk of total cluster failure.
Unfortunately, with EBJ enabled, queries that use an explicit broadcast hint may now fail. This is more likely with a larger data set.
The issue encountered is related to the way memory consumption was managed in the driver-side broadcast. The memory used by the driver-side broadcast is controlled by spark.driver.maxResultSize. In earlier versions, the memory used for the broadcast was not explicitly tracked and accounted for in the task memory manager. This meant that the task could underestimate the memory being used. As a result, large jobs could seemingly fail randomly, due to out of memory errors, while others would succeed due to luck.
With EBJ enabled, the memory allocated for the broadcast hash map is accurately tracked and deducted from the task's memory manager's available memory. This improvement reduces the risk of memory leaks and JVM out of memory errors. As a consequence, large broadcasts may be handled differently, but reliability has improved.
Solution
Tracking memory usage in EBJ is essential for preventing JVM out of memory errors, which can lead to the failure of all concurrent queries on a node.
Instead of using broadcast hints, you should let adaptive query execution (AWS | Azure | GCP) select which join is suitable for the workload based on the size of the data being processed.