Updated February 17th, 2023 by gopinath.chandrasekaran

Recover from a DELTA_LOG corruption error

Problem You are attempting to query a Delta table when you get an IllegalStateException error saying that the metadata could not be recovered. Error in SQL statement: IllegalStateException:  The metadata of your Delta table couldn't be recovered while Reconstructing version: 691193. Did you manually delete files in the _delta_log directory? Set spar...

2 min reading time
Updated September 12th, 2024 by gopinath.chandrasekaran

Handling WARN Message: 'Could not turn on CDF for table (table-name)' in Delta Live Tables Pipeline

Problem While running the Databricks Delta Live Tables (DLT) pipeline, you encounter a WARN message in DLT event logs. Could not turn on CDF for table <table-name>. The table contains reserved columns  [_change_type, _commit_version, _commit_timestamp] that will be used internally as metadata for the table's Change Data Feed. Change Data Feed ...

0 min reading time
Updated August 29th, 2024 by gopinath.chandrasekaran

Streaming application missing data from a Delta table when writing to a given destination

Problem  When using a streaming application to stream data from a Delta table and write to a given destination, you notice data loss.  Cause In trying to separately address a failed streaming job by using  startingVersion=latest , the tradeoff is possible data loss. The restarted query will read only from the latest available Delta version of the so...

0 min reading time
Updated October 28th, 2022 by gopinath.chandrasekaran

Structured streaming jobs slow down on every 10th batch

Problem You are running a series of structured streaming jobs and writing to a file sink. Every 10th run appears to run slower than the previous jobs. Cause The file sink creates a _spark_metadata folder in the target path. This metadata folder stores information about each batch, including which files are part of the batch. This is required to prov...

1 min reading time
Updated November 8th, 2022 by gopinath.chandrasekaran

Uncommitted files causing data duplication

Problem You had a network issue (or similar) while a write operation was in progress. You are rerunning the job, but partially uncommitted files during the failed run are causing unwanted data duplication. Cause How Databricks commit protocol works: The DBIO commit protocol (AWS | Azure | GCP) is transactional. Files are only committed after a trans...

1 min reading time
Load More