Problem
Delta writes can result in the creation of empty files if the source is empty. This can happen with a regular Delta write or a MERGE INTO (AWS | Azure | GCP) operation.
If your streaming application is writing to a target Delta table and your source data is empty on certain micro batches, it can result in writing empty files to your target Delta table.
Writing empty files to a Delta table should be avoided as they can cause performance issues (ex. too many small files, multiple unnecessary commits, etc.). If there were too many commits happening at a high frequency (either due to a very large inflow of high frequency events and/or due to a low streaming trigger frequency configuration), then it could result in too many small files on the target delta table. This too many small empty files can increase the overall listing time and thereby could hamper the subsequent read performance.
Cause
The writing of empty files is a known issue in Databricks Runtime 7.3 LTS. Empty writes create additional files as well as new versions in Delta.
If there are 1000 empty writes in a day you see 1000 empty files created which accumulate over time. Even a table with just three records can result in several thousand empty files, depending on how frequently writes are performed.
For example, in this sample Delta commit, numOutputRows is 0, however numTargetFilesAdded is 1. This means it has added one file, even though there are no output rows.
Operation - Write {"numFiles":"1","numOutputBytes":"2675","numOutputRows":"0"} OperationParameters{"mode":"Append","partitionBy":"[]"} Operation - Merge {"numOutputRows":"0","numSourceRows":"0","numTargetFilesAdded":"1","numTargetFilesRemoved":"0","numTargetRowsCopied":"0","numTargetRowsDeleted":"0","numTargetRowsInserted":"0","numTargetRowsUpdated":"0"}
Solution
You should upgrade your clusters to Databricks Runtime 9.1 LTS or above.
Databricks Runtime 9.1 LTS and above contains a fix for the issue and no longer creates empty files for empty writes.