Problem
When running an Auto Loader job in directory listing mode, you may experience increased wait time between micro-batches.
Cause
When the input file path is a nested directory path, the job takes time to list all the nested directories. Thus, the job has to wait for worker threads to make progress before processing the next batch, leading to increased wait time between micro-batches.
Solution
Use file notification mode instead of the directory listing method. Set cloudFiles.useNotifications
to true
in the readStream
options. This will save time in listing directories and process the files available in the queue.
For more information, please review the What is Auto Loader file notification mode? (AWS | Azure | GCP) documentation.
If you want to use directory listing mode, avoid using a nested directory path as an input. This will help in reducing the time spent in listing all the directories.