You are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table.
Despite the total size exceeding the limit set by
BroadcastHashJoin is used and Apache Spark returns an
org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver.maxResultSize=1073741824. You can disable broadcasts for this query using set spark.sql.autoBroadcastJoinThreshold=-1
This is due to a limitation with Spark’s size estimator.
If the estimated size of one of the DataFrames is less than the
autoBroadcastJoinThreshold, Spark may use
BroadcastHashJoin to perform the join. If the available nodes do not have enough resources to accommodate the broadcast DataFrame, your job fails due to an out of memory error.
There are three different ways to mitigate this issue.
Use ANALYZE TABLE to collect details and compute statistics about the DataFrames before attempting a join.
Cache the table you are broadcasting.
explainon your join command to return the physical plan.
Review the physical plan. If the broadcast join returns BuildLeft, cache the left side table. If the broadcast join returns BuildRight, cache the right side table.
In Databricks Runtime 7.0 and above, set the join type to
SortMergeJoinwith join hints enabled.