Identify duplicate data on append operations

Last published at: May 10th, 2022

A common issue when performing append operations on Delta tables is duplicate data.

For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table.

In this article, we review basic troubleshooting steps that you can use to identify duplicate records, as well as the user name, and notebooks or jobs that resulted in the duplicate data.

Identify columns with duplicate records

%sql

select count(*) as count, <column-name> from <table-name> group by <column-name> order by <column-name>

The output identifies all columns with duplicate data.

Identify input files with duplicate data

Select a data point from the previous query and use it to determine which files provided duplicate data.

%sql

select *, input_file_name() as path from <table-name> where <column-name>=<any-duplicated-value>

The output includes a column called path, which identifies the full path to each input file.

Identify the location table

%sql

describe table extended <table-name>

Use the location table results to search for parquet paths

%sh

grep -r 'part-<filename-01>.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log

%sh

grep -r 'part-<filename-02.snappy.parquet' /dbfs/user/hive/warehouse/<path-to-log>/_delta_log

The results allow you to identify the impacted Delta versions.

Check the Delta history for the impacted versions

%sql

select * from (describe history <table-name> ) t where t.version In(0,1)

The Delta history results provide the user name, as well as the notebook or job id that caused the duplicate to appear in the Delta table.

Now that you have identified the source of the duplicate data, you can modify the notebook or job to prevent it from happening.

Example notebook

Review the Identify duplicate data on append example notebook.

Databricks Help Center