How to Sort S3 files By Modification Time in Databricks Notebooks

Written by Adam Pavlacka

Last published at: May 9th, 2022

Problem

When you use the dbutils utility to list the files in a S3 location, the S3 files list in random order. However, dbutils doesn’t provide any method to sort the files based on their modification time. dbutils doesn’t list a modification time either.

Solution

Use the Hadoop filesystem API to sort the S3 files, as shown here:

%scala

import org.apache.hadoop.fs._
val path = new Path("/mnt/abc")
val fs = path.getFileSystem(spark.sessionState.newHadoopConf)
val inodes = fs.listStatus(path).sortBy(_.getModificationTime)
inodes.filter(_.getModificationTime > 0).map(t => (t.getPath, t.getModificationTime, t.getLen)).foreach(println)

This code uses the Hadoop filesystem’s listStatus method to sort the S3 files based on the modification time.