Updated November 8th, 2022 by gopinath.chandrasekaran

Uncommitted files causing data duplication

Problem You had a network issue (or similar) while a write operation was in progress. You are rerunning the job, but partially uncommitted files during the failed run are causing unwanted data duplication. Cause How Databricks commit protocol works: The DBIO commit protocol (AWS | Azure | GCP) is transactional. Files are only committed after a trans...

1 min reading time
Updated February 17th, 2023 by gopinath.chandrasekaran

Recover from a DELTA_LOG corruption error

Problem You are attempting to query a Delta table when you get an IllegalStateException error saying that the metadata could not be recovered. Error in SQL statement: IllegalStateException:  The metadata of your Delta table couldn't be recovered while Reconstructing version: 691193. Did you manually delete files in the _delta_log directory? Set spar...

2 min reading time
Updated October 28th, 2022 by gopinath.chandrasekaran

Structured streaming jobs slow down on every 10th batch

Problem You are running a series of structured streaming jobs and writing to a file sink. Every 10th run appears to run slower than the previous jobs. Cause The file sink creates a _spark_metadata folder in the target path. This metadata folder stores information about each batch, including which files are part of the batch. This is required to prov...

1 min reading time
Load More