Timestamp change to underlying Apache Parquet/change data files while using Change Data Capture (CDC)

For timestamp-based queries, ensure that the original file timestamps are preserved during the migration process.

Written by raphael.balogo

Last published at: September 12th, 2024

Problem

When using Change Data Capture (CDC) to consume incremental data from an external table, you see that the underlying Apache Parquet or change data files’ timestamp changes. 

This problem typically arises after moving the files to a different workspace and changing the underlying S3 bucket for the table. Despite copying all S3 files, including change data, to a new location and recreating the external table, CDC queries based on timestamp fail, while queries based on version number succeed.

Example of a failing query

select * from table_changes('db.schema.table','2024-05-30T18:09:55.000-04:00')

Example of a successful query

select * from table_changes('db.schema.table',4129)

Both queries are intended to retrieve the same change data.

Cause

Delta Lake uses the file modification time to determine the timestamp of a commit. 

When files are copied to a new S3 bucket, their timestamps change, and there is no option to preserve the original timestamps. As a result, CDC queries based on timestamp fail because they rely on the physical timestamp of the files, which no longer matches the original commit times. In contrast, version-based queries succeed because the delta_log versions remain consistent, regardless of the file timestamps.

Delta Lake's behavior is documented in a GitHub issue

Important

Cloning a Delta table creates a separate history, affecting time travel queries and change data feed. For more information, please refer to the Clone a table on Databricks (AWSAzureGCP) and Use Delta Lake change data feed on Databricks (AWSAzureGCP) documentation.

 

Solution

For timestamp-based queries, ensure that the original file timestamps are preserved during the migration process. If this is not possible, rely on version-based queries to retrieve change data.

Monitor the development of the in-commit timestamp feature, which is currently in preview. This feature aims to address the issue by using commit timestamps instead of file modification times. You can contact your account team to sign up for the Databricks private preview to access this feature earlier.

Review the Delta 4.0 roadmap and plan for its adoption once it becomes generally available in Databricks Runtime 16.x, as it includes enhancements to address this issue.

Was this article helpful?