Auto Loader fails to pick up new files when using directory listing mode

Use file notification mode or disable incremental listing.

Written by brock.baurer

Last published at: September 12th, 2024

Problem

You may encounter an issue where Auto Loader does not pick up new files in directory listing mode (AWSAzureGCP) in scenarios where the source cloudFiles file naming convention has changed.

Cause

This is related to the way lexical ordering works when using directory listing mode in Auto Loader. New files with different naming conventions are not being recognized as new.

Example

This Python example demonstrates a possible scenario that could occur.

# List of sample filenames in source cloudfiles location
filenames = [
    "MYAPP_1970-01-01.parquet",  # <-- older file
    "MYAPPX_1970-01-02.parquet", # <-- newer file with slightly modified naming convention (added X character before the _)
]
# Sort the filenames lexicographically
for filename in sorted(filenames):
    print(filename)
#MYAPPX_1970-01-02.parquet # new file listed first (considered oldest)
#MYAPP_1970-01-01.parquet  # old file listed last (considered newest)

To better understand how lexical ordering works, please review the Lexical ordering of files (AWSAzureGCP) documentation.

Solution

Use file notification mode

Use file notification mode (AWSAzureGCP) instead of directory listing mode. File notification mode is lower-latency, can be more cost-effective, and helps avoid lexical ordering issues.

Disable incremental listing

If you cannot use file notification mode, you should disable incremental listing by setting the Apache Spark option cloudFiles.useIncrementalListing to false. This allows new files to be picked up, although it may increase the time spent listing files.

Note

Incremental listing mode is deprecated and should not be used. For more information, review the Incremental Listing (deprecated) (AWSAzureGCP) documentation.

 

For more information and best practices on using Auto Loader, review the Configure Auto Loader for production workloads (AWSAzureGCP) documentation.