Problem
When using replaceWhere
during data ingestion to overwrite specific data partitions in a Delta table, you notice that new data are appended to, instead of replacing, old data, causing duplicates.
Cause
The replaceWhere
option is intended to be used during the write operation, not the read operation. When used during the read operation, it produces the problem described.
Solution
Modify your Delta Lake pipeline to include the replaceWhere
option during the write operation.
Example
In this example, the replaceWhere
option is used to atomically replace all records in the month of January 2017 in the target table with the data in replace_data.
replace_data.write \
.mode("overwrite") \
.option("replaceWhere", "start_date >= '2017-01-01' AND end_date <= '2017-01-31'") \
.save("/tmp/delta/events")
If you are using Delta Live Tables (DLT), use Data Manipulation Language (DML) to remove or drop duplicates. You can use the INSERT INTO REPLACE WHERE
statement on your target streaming table to eliminate duplicates.
INSERT INTO target_table
REPLACE WHERE start_date >= '2017-01-01' AND end_date <= '2017-01-31'