Inner join drops records in result

Avoid dropped records when performing an inner join.

Last published at: May 23rd, 2022

Problem

You perform an inner join, but the resulting joined table is missing data.

For example, assume you have two tables, orders and models.

%python

df_orders = spark.createDataFrame([('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima',''), ('Nissan','Altima', None)], ["Company", "Model", "Info"])

Sample orders table.

%python

df_models = spark.createDataFrame([('Nissan','Altima',''), ('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','2-door 3.5 SE Coupe'), ('Nissan','Altima','4-door 2.5 S Sedan'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima','4-door 3.5 SL Sedan'), ('Nissan','Altima','4-door HYBRID Sedan'), ('Nissan','Altima',None)], ["Company", "Model", "Info"])

Sample models table.

You attempt a straight join of the two tables.

%python
df_orders.createOrReplaceTempView("Orders")
df_models.createOrReplaceTempView("Models")
SQL
Copy to clipboardCopy
SELECT *
MAGIC FROM Orders a
MAGIC INNER JOIN Models b
MAGIC ON a.Company = b.Company
MAGIC AND a.Model = b.Model
MAGIC AND a.Info = b.Info

The resulting joined table only includes three of the four records from the orders table. The record with a null value in a column does not appear in the results.

Results of a default inner join. Null values are dropped.

Cause

Apache Spark does not consider null values when performing a join operation.

If you attempt to join tables, and some of the columns contain null values, the null records will not be included in the resulting joined table.

Solution

If your source tables contain null values, you should use the Spark null safe operator (<=>).

When you use <=> Spark processes null values (instead of dropping them) when performing a join.

For example, if we modify the sample code with <=>, the resulting table does not drop the null values.

%sql

SELECT *
MAGIC FROM Orders a
MAGIC INNER JOIN Models b
MAGIC ON a.Company = b.Company
MAGIC AND a.Model = b.Model
MAGIC AND a.Info <=> b.Info

Results of a null safe inner join. Null values are visible.

Example notebook

Review the Inner join drops null values example notebook.

Databricks Help Center

Problem

Cause

Solution

Example notebook

Contact Us