Problem
When you use the dbutils utility to list the files in a S3 location, the S3 files list in random order. However, dbutils doesn’t provide any method to sort the files based on their modification time. dbutils doesn’t list a modification time either.
Solution
Use the Hadoop filesystem API to sort the S3 files, as shown here:
%scala import org.apache.hadoop.fs._ val path = new Path("/mnt/abc") val fs = path.getFileSystem(spark.sessionState.newHadoopConf) val inodes = fs.listStatus(path).sortBy(_.getModificationTime) inodes.filter(_.getModificationTime > 0).map(t => (t.getPath, t.getModificationTime, t.getLen)).foreach(println)
This code uses the Hadoop filesystem’s listStatus method to sort the S3 files based on the modification time.