Problem
When working with Apache Spark SQL or Databricks SQL, you want to use WHERE key NOT IN (SELECT key FROM other) to filter rows from one dataset that do not appear in another dataset. The query returns zero rows which is not correct.
However, when you use the following query, it returns the expected rows.
LEFT JOIN other ON a.key = other.key WHERE other.key IS NULL
(or) WHERE NOT EXISTS (SELECT 1 FROM other WHERE other.key = a.key)
Cause
The Spark SQL and Databricks SQL engines implement NOT IN as a null-aware anti join (NAAJ). In Spark’s optimized plan, you may see a predicate such as the following.
LeftAnti, (a.key = b.key) OR isnull(a.key = b.key)
This null-aware clause ensure every left row is considered matched if the right side has a NULL. As a result the anti-join eliminates the left row, matching SQL semantics for NOT IN.
Additional context
In general, column X NOT IN (subquery) is logically x != y1 AND x != y2 AND …. If any value in the subquery is NULL, then x != NULL is UNKNOWN. The entire AND chain becomes UNKNOWN, and the row is filtered out.
Solution
Use NULL-safe anti-join patterns or explicitly exclude NULLs when using NOT IN. You can adapt and use the following example code.
SELECT ...
FROM a
WHERE a.key IS NOT NULL
AND a.key NOT IN (SELECT key FROM b WHERE key IS NOT NULL);