Identify duplicate data on append operations

A common issue when performing append operations on Delta tables is duplicate data.

For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table.

In this article, we review basic troubleshooting steps that you can use to identify duplicate records, as well as the user name, and notebooks or jobs that resulted in the duplicate data.

Identify columns with duplicate records

select count(*) as count, <column-name> from <table-name> group by <column-name> order by <column-name>

The output identifies all columns with duplicate data.

Identify input files with duplicate data

Select a data point from the previous query and use it to determine which files provided duplicate data.

select *, input_file_name() as path from <table-name> where <column-name>=<any-duplicated-value>

The output includes a column called path, which identifies the full path to each input file.

Identify the location table

describe table extended <table-name>

Use the location table results to search for parquet paths

grep -r 'part-<filename-01>.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log
grep -r 'part-<filename-02.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log

The results allow you to identify the impacted Delta versions.

Check the Delta history for the impacted versions

select * from (describe history <table-name> ) t where t.version In(0,1)

The Delta history results provide the user name, as well as the notebook or job id that caused the duplicate to appear in the Delta table.

Now that you have identified the source of the duplicate data, you can modify the notebook or job to prevent it from happening.

Example notebook

Identify duplicate data on append example

Open notebook in new tab