Increased wait times between micro-batches in Auto Loader

Use file notification mode instead of the directory listing method.

Written by lakshay.goel

Last published at: September 10th, 2024

Problem 

When running an Auto Loader job in directory listing mode, you may experience increased wait time between micro-batches. 

Cause

When the input file path is a nested directory path, the job takes time to list all the nested directories. Thus, the job has to wait for worker threads to make progress before processing the next batch, leading to increased wait time between micro-batches.

Solution

Use file notification mode instead of the directory listing method. Set cloudFiles.useNotifications  to true in the readStream options. This will save time in listing directories and process the files available in the queue.

For more information, please review the What is Auto Loader file notification mode? (AWSAzureGCP) documentation.

If you want to use directory listing mode, avoid using a nested directory path as an input. This will help in reducing the time spent in listing all the directories.