Problem
You notice a query takes a long time to execute on all-purpose or jobs compute, so you add a broadcast hint. You set autoBroadcastJoinThreshold
to the value of the smaller table’s data size, but the query does not switch to a broadcast join.
Cause
By default, Databricks does not automatically broadcast a table if its size is larger than the autoBroadcastJoinThreshold
configuration.
If you have filtered the table and reduced the size, a table smaller than the autoBroadcastJoinThreshold
configuration may still not be broadcasted due to empty partitions.
Solution
To ensure Databricks switches to the broadcast join even with empty partitions, set the nonEmptyPartitionRatioForBroadcastJoin
configuration to 0.0
.
Then set the autoBroadcastJoinThreshold
configuration to a value larger than the filtered size of the smaller table. For example, if the filtered size is 300 MB
, set the configuration to a value greater than 300 MB.
Run the following code to set these configurations using a notebook. The value 309715200
is an example value over 300 MB. Replace this value with a value greater than the filtered size of your own smaller table.
```python
spark.conf.set("spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin", "0.0")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "309715200")
spark.conf.set("spark.databricks.adaptive.autoBroadcastJoinThreshold", "309715200")
```
Additional measures
You can optionally use a Photon cluster for faster scanning of the larger table.
To enable or disable Photon on all-purpose and jobs compute, select the Use Photon Acceleration checkbox in the compute UI. Photon is not enabled by default on computes created with the clusters API or jobs API. To enable Photon, you must set the runtime_engine
attribute to PHOTON
.
For more information, review the “Configure Photon enablement” section of the What is Photon? (AWS | Azure | GCP) documentation.