A common issue when performing append operations on Delta tables is duplicate data.
For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table.
In this article, we review basic troubleshooting steps that you can use to identify duplicate records, as well as the user name, and notebooks or jobs that resulted in the duplicate data.
Identify columns with duplicate records
%sql select count(*) as count, <column-name> from <table-name> group by <column-name> order by <column-name>
The output identifies all columns with duplicate data.
Identify input files with duplicate data
Select a data point from the previous query and use it to determine which files provided duplicate data.
%sql select *, input_file_name() as path from <table-name> where <column-name>=<any-duplicated-value>
The output includes a column called path, which identifies the full path to each input file.
Identify the location table
%sql describe table extended <table-name>
Use the location table results to search for parquet paths
%sh grep -r 'part-<filename-01>.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log
%sh grep -r 'part-<filename-02.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log
The results allow you to identify the impacted Delta versions.
Check the Delta history for the impacted versions
%sql select * from (describe history <table-name> ) t where t.version In(0,1)
The Delta history results provide the user name, as well as the notebook or job id that caused the duplicate to appear in the Delta table.
Now that you have identified the source of the duplicate data, you can modify the notebook or job to prevent it from happening.
Example notebook
Review the Identify duplicate data on append example notebook.