When you process streaming files with Auto Loader (AWS | Azure | GCP), events are logged based on the files created in the underlying storage.
This article shows you how to add the file path for every filename to a new column in the output DataFrame.
One use case for this is auditing. When files are ingested to a partitioned folder structure there is often useful metadata, such as the timestamp, which can be extracted from the path for auditing purposes.
For example, assume a file path and filename of 2020/2021-01-01/file1_T191634.csv.
From this path you can apply custom UDFs and use regular expressions to extract details like the date (2021-01-01) and the timestamp (T191634).
The following example code uses input_file_name() get the path and filename for every row and write it to a new column named filePath.
%scala val df = spark.readStream.format("cloudFiles") .schema(schema) .option("cloudFiles.format", "csv") .option("cloudFiles.region","ap-south-1") .load("path") .withColumn("filePath",input_file_name())