Job fails when using Spark-Avro to write decimal values to AWS Redshift

Learn how to resolve job failures when writing decimal values to AWS Redshift with Spark-Avro.

Written by Adam Pavlacka

Last published at: May 31st, 2022

Problem

In Databricks Runtime versions 5.x and above, when writing decimals to Amazon Redshift using Spark-Avro as the default temp file format, either the write operation fails with the exception:

Error (code 1207) while loading data into Redshift: "Invalid digit, Value '"', Pos 0, Type: Decimal"

or the write operation writes nulls in place of the decimal values.

Cause

When writing to Redshift, data is first stored in a temp folder in S3 before being loaded into Redshift. The default format used for storing temp data between Apache Spark and Redshift is Spark-Avro. However, Spark-Avro stores a decimal as a binary, which is interpreted by Redshift as empty strings or nulls.

Solution

Change the temp file format to CSV using the tempformat option. You can use this sample Scala code:

%scala

//Create sample data
case class createDec(value: BigDecimal)
val df = Seq(createDec(45.24)).toDS
//Write to Redshift
(df.write
  .format("com.databricks.spark.redshift")
  .option("url", jdbcUrl)
  .option("tempdir", tempDir)
  .option("dbtable", "testtable")
  .option("aws_iam_role", "your_aws_iam_role")
  .option("tempformat", "CSV")
  .mode("append")
  .save())