Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table.
One common use case is to compare two versions of a Delta table in order to identify what changed.
For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP).
Identify all differences
You can use a SQL SELECT query to identify all differences between two versions of a Delta table.
You need to know the name of the table and the version numbers of the snapshots you want to compare.
%sql select * from <table-name>@v<version-number> except all select * from <table-name>@v<version-number>
For example, if you had a table named “schedule” and you wanted to compare version 2 with the original version, your query would look like this:
%sql select * from schedule@v2 except all select * from schedule@v0
Identify files added to a specific version
You can use a Scala query to retrieve a list of files that were added to a specific version of the Delta table.
%scala display(spark.read.json("dbfs:/<path-to-delta-table>/_delta_log/00000000000000000002.json").where("add is not null").select("add.path"))
In this example, we are getting a list of all files that were added to version 2 of the Delta table.
00000000000000000002.json contains the list of all files in version 2.
After reading in the full list, we are excluding files that already existed, so the displayed list only includes files added to version 2.