Problem
You may encounter an issue where Auto Loader does not pick up new files in directory listing mode (AWS | Azure | GCP) in scenarios where the source cloudFiles
file naming convention has changed.
Cause
This is related to the way lexical ordering works when using directory listing mode in Auto Loader. New files with different naming conventions are not being recognized as new.
Example
This Python example demonstrates a possible scenario that could occur.
# List of sample filenames in source cloudfiles location
filenames = [
"MYAPP_1970-01-01.parquet", # <-- older file
"MYAPPX_1970-01-02.parquet", # <-- newer file with slightly modified naming convention (added X character before the _)
]
# Sort the filenames lexicographically
for filename in sorted(filenames):
print(filename)
#MYAPPX_1970-01-02.parquet # new file listed first (considered oldest)
#MYAPP_1970-01-01.parquet # old file listed last (considered newest)
To better understand how lexical ordering works, please review the Lexical ordering of files (AWS | Azure | GCP) documentation.
Solution
Use file notification mode
Use file notification mode (AWS | Azure | GCP) instead of directory listing mode. File notification mode is lower-latency, can be more cost-effective, and helps avoid lexical ordering issues.
Disable incremental listing
If you cannot use file notification mode, you should disable incremental listing by setting the Apache Spark option cloudFiles.useIncrementalListing
to false.
This allows new files to be picked up, although it may increase the time spent listing files.
Note
Incremental listing mode is deprecated and should not be used. For more information, review the Incremental Listing (deprecated) (AWS | Azure | GCP) documentation.
For more information and best practices on using Auto Loader, review the Configure Auto Loader for production workloads (AWS | Azure | GCP) documentation.