Get the path of files consumed by Auto Loader

Get the path and filename of all files consumed by Auto Loader and write them out as a new column.

Written by Adam Pavlacka

Last published at: May 18th, 2022

When you process streaming files with Auto Loader (AWS | Azure | GCP), events are logged based on the files created in the underlying storage.

This article shows you how to add the file path for every filename to a new column in the output DataFrame.

One use case for this is auditing. When files are ingested to a partitioned folder structure there is often useful metadata, such as the timestamp, which can be extracted from the path for auditing purposes.

For example, assume a file path and filename of 2020/2021-01-01/file1_T191634.csv.

From this path you can apply custom UDFs and use regular expressions to extract details like the date (2021-01-01) and the timestamp (T191634).

The following example code uses input_file_name() get the path and filename for every row and write it to a new column named filePath.

%scala

val df = spark.readStream.format("cloudFiles")
  .schema(schema)
  .option("cloudFiles.format", "csv")
  .option("cloudFiles.region","ap-south-1")
  .load("path")
  .withColumn("filePath",input_file_name())