Problem
When using Databricks with Apache Sedona for spatial processing, you notice performing a left anti-join on spatial data often results in Apache Spark enforcing a BroadcastNestedLoopJoin. You notice significant query performance degradation, especially on large datasets.
Cause
Spark falls back to BroadcastNestedLoopJoin for left_anti joins when the join condition is non-equivalent (for example, spatial predicates like ST_Intersects
or ST_Contains
). Anti-joins are harder to optimize with spatial indexes, and Spark's optimizer may not efficiently push down spatial predicates for this join type.
Solution
Use an alternative pattern that combines a left outer join with a filter instead.
- Perform a left outer join between your two DataFrames or tables on the relevant keys.
- Filter the resulting rows to retain only those where the right-side (joined) key is
NULL
.
This is logically equivalent to a left anti-join.
Example
from pyspark.sql.functions import col
df1_alias = df1.alias("left")
df2_alias = df2.alias("right")
left_outer = df1_alias.join(df2_alias, on=["id"], how="left_outer")
replicated_anti = left_outer.filter(col("right.id").isNull()).select("left.id", "left.value")
replicated_anti.show()
Important
Verify you do not have NULL
values before implementing the solution. If the table used as a joining key has NULL
values, it can lead to a cross join.