SQL filters using NOT IN returning zero rows instead of expected number of rows

Use NULL-safe anti-join patterns or explicitly exclude NULLs when using NOT IN.

Written by anudeep.konaboina

Last published at: October 3rd, 2025

Problem

When working with Apache Spark SQL or Databricks SQL, you want to use WHERE key NOT IN (SELECT key FROM other) to filter rows from one dataset that do not appear in another dataset. The query returns zero rows which is not correct.

 

However, when you use the following query, it returns the expected rows.

LEFT JOIN other ON a.key = other.key WHERE other.key IS NULL
 (or) WHERE NOT EXISTS (SELECT 1 FROM other WHERE other.key = a.key)

 

Cause

The Spark SQL and Databricks SQL engines implement NOT IN as a null-aware anti join (NAAJ). In Spark’s optimized plan, you may see a predicate such as the following.

LeftAnti, (a.key = b.key) OR isnull(a.key = b.key)

 

This null-aware clause ensure every left row is considered matched if the right side has a NULL. As a result the anti-join eliminates the left row, matching SQL semantics for NOT IN.

 

Additional context

In general, column X NOT IN (subquery) is logically x != y1 AND x != y2 AND …. If any value in the subquery is NULL, then x != NULL is UNKNOWN. The entire AND chain becomes UNKNOWN, and the row is filtered out.

 

Solution

Use NULL-safe anti-join patterns or explicitly exclude NULLs when using NOT IN. You can adapt and use the following example code.

SELECT  ...
FROM    a
WHERE   a.key IS NOT NULL
  AND   a.key NOT IN (SELECT key FROM b WHERE key IS NOT NULL);