How to populate or update columns in an existing Delta table
Problem
You have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file.
Solution
In this example, there is a customers
table, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format.
Create a
DataFrame
from the Parquet file using an Apache Spark API statement:updatesDf = spark.read.parquet("/path/to/raw-file")
View the contents of the
updatesDF
DataFrame
:display(updatesDf)
Create a table from the
updatesDf
DataFrame
. In this example, it is namedupdates
.updatesDf.createOrReplaceTempView("updates")
Check the contents of the updates table, and compare it to the contents of
customers
:display(customers)
Use the
MERGE INTO
statement to merge the data from theupdates
table into the originalcustomers
table.MERGE INTO customers USING updates ON customers.customerId = source.customerId WHEN MATCHED THEN UPDATE SET address = updates.address WHEN NOT MATCHED THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)
Here, customers
is the original Delta table that has an address
column with missing values. updates
is the table created from the DataFrame
updatesDf
, which is created by reading data from the raw file. The address
column of the original Delta table is populated with the values from updates
, overwriting any existing values in the address
column.
If updates
contains customers that are not already in the customers
table, then the command adds these new customer records.
For more examples of using MERGE INTO
, see Merge Into (Delta Lake).