Broadcast join exceeds threshold, returns out of memory error

Problem

You are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table.

Despite the total size exceeding the limit set by spark.sql.autoBroadcastJoinThreshold, BroadcastHashJoin is used and Apache Spark returns an OutOfMemorySparkException error.

org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=1073741824. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1

Cause

This is due to a limitation with Spark’s size estimator.

If the estimated size of one of the DataFrames is less than the autoBroadcastJoinThreshold, Spark may use BroadcastHashJoin to perform the join. If the available nodes do not have enough resources to accommodate the broadcast DataFrame, your job fails due to an out of memory error.

Solution

There are three different ways to mitigate this issue.

  • Use ANALYZE TABLE to collect details and compute statistics about the DataFrames before attempting a join.

  • Cache the table you are broadcasting.

    1. Run explain on your join command to return the physical plan.

      explain(<join command>)
      
    2. Review the physical plan. If the broadcast join returns BuildLeft, cache the left side table. If the broadcast join returns BuildRight, cache the right side table.

  • In Databricks Runtime 7.0 and above, set the join type to SortMergeJoin with join hints enabled.