Write a DataFrame with missing columns to a Redshift table

Problem

When writing to Redshift tables, if the target table has more columns than the source Apache Spark DataFrame you may get a copy error.

The COPY failed with error: [Amazon][Amazon Redshift] (1203) Error occurred while trying to execute a query: ERROR: Load into table table-name failed. Check the 'stl_load_errors' system table for details. “1203 - Input data had fewer columns than were defined in the DDL”

Cause

The source Spark DataFrame and the target Redshift table need to have the same number of columns.

Solution

Option 1: Update the notebook or job operation to add the missing columns in the spark DataFrame.

You can populate the new columns with null values if there is no data, or with actual values if there is new data that needs to be written to the target Redshift table.

This option requires manual intervention and can become time consuming if there are a large number of notebooks or jobs that need to be modified, or if new columns are added to the target on a regular basis.

Option 2: Use the AWS Redshift data conversion parameter FILLRECORD.

When FILLRECORD is used, it allows data files to be loaded when contiguous columns are missing at the end of some of the records. The missing columns are filled with either zero-length strings or null values, as appropriate for the data types of the columns in question.

FILLRECORD can be specified using extracopyoptions while performing the df.write operation.

df.write \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable", "my_table_copy") \
  .option("tempdir", "s3n://path/for/temp/data") \
  .option("extracopyoptions", "FILLRECORD") \
  .save()