Problem
A Delta Live Table (DLT) pipeline using Auto Loader appears to hang or process data very slowly when handling large datasets with a Glob filter in the file path. The Structured Streaming Hub (SSH) does not show progress for new micro-batches.
Cause
This behavior arises because Auto Loader's directory listing mode processes files by first listing all objects in the specified path before applying the Glob filter. This approach requires the entire directory, including all subfolders, to be scanned, even if the Glob pattern excludes many of them.
Cloud providers do not allow filtering during the listing phase. As a result, Auto Loader must first discover all files and then apply the Glob filter to include or exclude files.
When the directory contains millions of files, this listing process can take a significant amount of time and is executed in a single thread by default. This delay prevents timely updates to micro-batches in the SSH. For example, if the source directory contained over 500 million files, excessive backfill times could occur as the directory listing process could not filter out unnecessary files early.
Solution
File notification mode (recommended)
Databricks recommends enabling file notification mode when using Auto Loader. This mode bypasses the need for directory scanning by leveraging cloud-native event notifications to detect new files and eliminates directory listing delays.
Review the What is Auto Loader file notification mode? (AWS | Azure | GCP) documentation to learn how to configure file notification mode.
Directory listing mode
If file notification mode is not feasible, consider the following mitigations to improve performance in directory listing mode:
- Disable asynchronous directory listing for backfilling.
The Apache Spark configurationspark.databricks.cloudFiles.asyncDirListing
determines how backfilling operations handle directory listings in Databricks. Disabling this configuration (spark.databricks.cloudFiles.asyncDirListing = false
) can distribute the listing tasks across executors, potentially speeding up the listing process in scenarios where backfilling is taking too long.
You will have to measure the tradeoff in terms of performance with your specific workloads to determine if this is indeed beneficial.
- Partition your source directories. Instead of applying a broad Glob filter over a large directory tree, divide the data into smaller, more manageable partitions.
For example:- Create separate streams for each specific path or time-based partition.
- Consolidate data from multiple streams after processing.
- Periodically archive processed files. Move or archive processed files to another location to reduce the size of the source directory and improve listing times for future runs.
These steps ensure that micro-batches are processed efficiently, reducing delays and avoiding the appearance of a "stuck" pipeline in the SSH.