Delta writing empty files when source is empty

Delta can write empty files under Databricks Runtime 7.3 LTS. You should upgrade to Databricks Runtime 9.1 LTS or above to resolve the issue.

Written by Rajeev kannan Thangaiah

Last published at: December 2nd, 2022

Problem

Delta writes can result in the creation of empty files if the source is empty. This can happen with a regular Delta write or a MERGE INTO (AWS | Azure | GCP) operation.

If your streaming application is writing to a target Delta table and your source data is empty on certain micro batches, it can result in writing empty files to your target Delta table. 

Writing empty files to a Delta table should be avoided as they can cause performance issues (ex. too many small files, multiple unnecessary commits, etc.). If there were too many commits happening at a high frequency (either due to a very large inflow of high frequency events and/or due to a low streaming trigger frequency configuration), then it could result in too many small files on the target delta table. This too many small empty files can increase the overall listing time and thereby could hamper the subsequent read performance.

Cause

The writing of empty files is a known issue in Databricks Runtime 7.3 LTS. Empty writes create additional files as well as new versions in Delta.

If there are 1000 empty writes in a day you see 1000 empty files created which accumulate over time. Even a table with just three records can result in several thousand empty files, depending on how frequently writes are performed.

For example, in this sample Delta commit, numOutputRows is 0, however numTargetFilesAdded is 1. This means it has added one file, even though there are no output rows.

Operation - Write
 {"numFiles":"1","numOutputBytes":"2675","numOutputRows":"0"} OperationParameters{"mode":"Append","partitionBy":"[]"}

Operation - Merge
{"numOutputRows":"0","numSourceRows":"0","numTargetFilesAdded":"1","numTargetFilesRemoved":"0","numTargetRowsCopied":"0","numTargetRowsDeleted":"0","numTargetRowsInserted":"0","numTargetRowsUpdated":"0"}

Solution

You should upgrade your clusters to Databricks Runtime 9.1 LTS or above.

Databricks Runtime 9.1 LTS and above contains a fix for the issue and no longer creates empty files for empty writes.

Delete

Info

If you cannot upgrade to Databricks Runtime 9.1 LTS or above, you should periodically run OPTIMIZE (AWS | Azure | GCP) on the affected table to clean up the empty files. This is not a permanent fix and should be considered a workaround until you can upgrade to a newer runtime.