A common issue when performing append operations on Delta tables is duplicate data.
For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table.
In this article, we review basic troubleshooting steps that you can use to identify duplicate records, as well as the user name, and notebooks or jobs that resulted in the duplicate data.
select count(*) as count, <column-name> from <table-name> group by <column-name> order by <column-name>
The output identifies all columns with duplicate data.
Select a data point from the previous query and use it to determine which files provided duplicate data.
select *, input_file_name() as path from <table-name> where <column-name>=<any-duplicated-value>
The output includes a column called
path, which identifies the full path to each input file.
grep -r 'part-<filename-01>.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log
grep -r 'part-<filename-02.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log
The results allow you to identify the impacted Delta versions.
select * from (describe history <table-name> ) t where t.version In(0,1)
The Delta history results provide the user name, as well as the notebook or job id that caused the duplicate to appear in the Delta table.
Now that you have identified the source of the duplicate data, you can modify the notebook or job to prevent it from happening.