Get the path of files consumed by Auto Loader

When you process streaming files with Auto Loader, events are logged based on the files created in the underlying storage.

This article shows you how to add the file path for every filename to a new column in the output DataFrame.

One use case for this is auditing. When files are ingested to a partitioned folder structure there is often useful metadata, such as the timestamp, which can be extracted from the path for auditing purposes.

For example, assume a file path and filename of 2020/2021-01-01/file1_T191634.csv.

From this path you can apply custom UDFs and use regular expressions to extract details like the date (2021-01-01) and the timestamp (T191634).

The following example code uses input_file_name() get the path and filename for every row and write it to a new column named filePath.

val df = spark.readStream.format("cloudFiles")
  .schema(schema)
  .option("cloudFiles.format", "csv")
  .option("cloudFiles.region","ap-south-1")
  .load("path")
  .withColumn("filePath",input_file_name())