How to Sort S3 files By Modification Time in Databricks Notebooks
Problem
When you use the dbutils
utility to list the files in a S3 location, the S3 files list in random order. However, dbutils
doesn’t provide any method to sort the files based on their modification time. dbutils
doesn’t list a modification time either.
Solution
Use the Hadoop filesystem API to sort the S3 files, as shown here:
import org.apache.hadoop.fs._
val path = new Path("/mnt/abc")
val fs = path.getFileSystem(spark.sessionState.newHadoopConf)
val inodes = fs.listStatus(path).sortBy(_.getModificationTime)
inodes.filter(_.getModificationTime > 0).map(t => (t.getPath, t.getModificationTime, t.getLen)).foreach(println)
This code uses the Hadoop filesystem’s listStatus
method to sort the S3 files based on the modification time.