You have an existing Delta Lake table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file.
In this example, there is a
customers table, which is an existing Delta Lake table. It has an address column with missing values. The updated data exists in Parquet format.
DataFramefrom the Parquet file using an Apache Spark API statement:
updatesDf = spark.read.parquet(“/path/to/raw-file”)
View the contents of the
Create a table from the
DataFrame. In this example, it is named
Check the contents of the updates table, and compare it to the contents of
MERGE INTOstatement to merge the data from the
updatestable into the original
MERGE INTO customers USING updates ON customers.customerId = source.customerId WHEN MATCHED THEN UPDATE SET address = updates.address WHEN NOT MATCHED THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)
customers is the original Delta Lake table that has an
address column with missing values.
updates is the table created from the
updatesDf, which is created by reading data from the raw file. The
address column of the original Delta Lake table is populated with the values from
updates, overwriting any existing values in the
updates contains customers that are not already in the
customers table, then the command adds these new customer records.
For more examples of using
MERGE INTO, see Merge Into (Delta Lake).