Working with spatial data and performing a left-anti join leading to a performance bottleneck

Instead perform a left-outer join then filter on the join key.

Written by shubham.bhusate

Last published at: August 8th, 2025

Problem

When using Databricks with Apache Sedona for spatial processing, you notice performing a left anti-join on spatial data often results in Apache Spark enforcing a BroadcastNestedLoopJoin. You notice significant query performance degradation, especially on large datasets.

 

Cause

Spark falls back to BroadcastNestedLoopJoin for left_anti joins when the join condition is non-equivalent (for example, spatial predicates like ST_Intersects or ST_Contains). Anti-joins are harder to optimize with spatial indexes, and Spark's optimizer may not efficiently push down spatial predicates for this join type.

 

Solution

Use an alternative pattern that combines a left outer join with a filter instead.

  1. Perform a left outer join between your two DataFrames or tables on the relevant keys.
  2. Filter the resulting rows to retain only those where the right-side (joined) key is NULL.
    This is logically equivalent to a left anti-join.

 

Example 

from pyspark.sql.functions import col
df1_alias = df1.alias("left")
df2_alias = df2.alias("right")

left_outer = df1_alias.join(df2_alias, on=["id"], how="left_outer")

replicated_anti = left_outer.filter(col("right.id").isNull()).select("left.id", "left.value")
replicated_anti.show()

 

Important

Verify you do not have NULL values before implementing the solution. If the table used as a joining key has NULL values, it can lead to a cross join.