Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table.
One common use case is to compare two versions of a Delta table in order to identify what changed.
For more details on time travel, please review the Delta Lake time travel documentation.
You can use a SQL
SELECT query to identify all differences between two versions of a Delta table.
You need to know the name of the table and the version numbers of the snapshots you want to compare.
select * from <table-name>@v<version-number> except all select * from <table-name>@v<version-number>
For example, if you had a table named “schedule” and you wanted to compare version 2 with the original version, your query would look like this:
select * from schedule@v2 except all select * from schedule@v0
You can use a scala query to retrieve a list of files that were added to a specific version of the Delta table.
display(spark.read.json("dbfs:/<path-to-delta-table>/_delta_log/00000000000000000002.json").where("add is not null").select("add.path"))
In this example, we are getting a list of all files that were added to version 2 of the Delta table.
00000000000000000002.json contains the list of all files in version 2.
After reading in the full list, we are excluding files that already existed, so the displayed list only includes files added to version 2.