Uncommitted files causing data duplication
Problem You had a network issue (or similar) while a write operation was in progress. You are rerunning the job, but partially uncommitted files during the failed run are causing unwanted data duplication. Cause How Databricks commit protocol works: The DBIO commit protocol (AWS | Azure | GCP) is transactional. Files are only committed after a trans...
1 min reading timeRecover from a DELTA_LOG corruption error
Problem You are attempting to query a Delta table when you get an IllegalStateException error saying that the metadata could not be recovered. Error in SQL statement: IllegalStateException: The metadata of your Delta table couldn't be recovered while Reconstructing version: 691193. Did you manually delete files in the _delta_log directory? Set spar...
2 min reading timeStructured streaming jobs slow down on every 10th batch
Problem You are running a series of structured streaming jobs and writing to a file sink. Every 10th run appears to run slower than the previous jobs. Cause The file sink creates a _spark_metadata folder in the target path. This metadata folder stores information about each batch, including which files are part of the batch. This is required to prov...
1 min reading time