How to populate or update columns in an existing Delta table

Learn how to populate or update columns in an existing Delta table.

Written by Adam Pavlacka

Last published at: May 10th, 2022

Problem

You have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file.

Solution

In this example, there is a customers table, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format.

  1. Create a DataFrame from the Parquet file using an Apache Spark API statement:
    %python
    
    updatesDf = spark.read.parquet("/path/to/raw-file")
  2. View the contents of the updatesDF DataFrame:
    %python
    
    display(updatesDf)
  3. Create a table from the updatesDf DataFrame. In this example, it is named updates.
    %python
    
    updatesDf.createOrReplaceTempView("updates")
  4. Check the contents of the updates table, and compare it to the contents of customers:
    %python
    
    display(customers)
  5. Use the MERGE INTO statement to merge the data from the updates table into the original customers table.
    %sql
    
    MERGE INTO customers
    USING updates
    ON customers.customerId = source.customerId
    WHEN MATCHED THEN
      UPDATE SET address = updates.address
    WHEN NOT MATCHED
      THEN INSERT (customerId, address) VALUES (updates.customerId, updates.address)

Here, customers is the original Delta table that has an address column with missing values. updates is the table created from the DataFrame updatesDf, which is created by reading data from the raw file. The address column of the original Delta table is populated with the values from updates, overwriting any existing values in the address column.

If updates contains customers that are not already in the customers table, then the command adds these new customer records.

For more examples of using MERGE INTO, see Merge Into (Delta Lake) (AWS | Azure | GCP).