Handling Data Duplication Issues with Databricks Autoloader and Delta Lake using replaceWhere

Ensure that the replaceWhere option is applied during the write operation

Written by Ravivarma S

Last published at: September 12th, 2024

Problem 

When using replaceWhere during data ingestion to overwrite specific data partitions in a Delta table, you notice that new data are appended to, instead of replacing, old data, causing duplicates. 

Cause

The replaceWhere option is intended to be used during the write operation, not the read operation. When used during the read operation, it produces the problem described. 

Solution

Modify your Delta Lake pipeline to include the replaceWhere option during the write operation.

Example 

In this example, the replaceWhere option is used to atomically replace all records in the month of January 2017 in the target table with the data in replace_data.

replace_data.write \
.mode("overwrite") \
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'") \
.save("/tmp/delta/events")

If you are using Delta Live Tables (DLT), use Data Manipulation Language (DML) to remove or drop duplicates. You can use the INSERT INTO REPLACE WHERE statement on your target streaming table to eliminate duplicates.

INSERT INTO target_table
REPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31'