Problem
You are streaming from Delta Live Tables when you notice that the Written metric in the Data Quality tab shows a different number of records than the actual number of rows in the target table. The Data Quality tab can be found in the settings of the DLT pipeline used for the run.
If you use SELECT count(*)
to query the number of rows in the target table, the result can be lower than what is displayed in the Written metric.
The issue can be observed in the Databricks workspace and the problem is not specific to any particular configuration or setting but may be more prevalent in streaming use cases.
Cause
The Written metric in the Data Quality tab represents the number of records that have been processed and attempted to be written to the target table. This includes all records that have passed through the pipeline, regardless of whether they were ultimately inserted into the table.
The discrepancy between the Written metric and the actual number of rows in the table can be attributed to various factors, such as:
- Duplicate records being processed and written, but later dropped or deleted due to constraints or data quality rules defined in the DLT pipeline.
- Records being ignored or not stored due to failing validation or other related reasons.
- The Written metric counting records that were written to the table but later rolled back due to transactional failures or other issues.
Solution
It is important to understand that the Written metric includes all records that have been processed, regardless of whether they were ultimately inserted into the table. It is not a count of the actual number of rows in the target table.
There are steps you can take to track the amount of data being written to a table.
- Utilize the DLT expectations feature to validate row counts across tables in the pipeline. This can help you get a better understanding of how much data is being written to the table at each stage in the DLT pipeline.
- Review the data quality rules and constraints defined in the DLT pipeline to ensure that they are not causing records to be dropped or ignored unnecessarily.
- Monitor transactions and rollbacks to ensure that records are not being written to the table, as they could have been rolled back due to transactional failures or other issues.
For more information on the expectations feature, review the What are Delta Live Tables expectations? (AWS | Azure | GCP) documentation.