Job fails when using Spark-Avro to write decimal values to AWS Redshift
Problem
In Databricks Runtime versions 5.x and above, when writing decimals to Amazon Redshift using Spark-Avro as the default temp file format, either the write operation fails with the exception:
Error (code 1207) while loading data into Redshift: "Invalid digit, Value '"', Pos 0, Type: Decimal"
or the write operation writes nulls in place of the decimal values.
Cause
When writing to Redshift, data is first stored in a temp folder in S3 before being loaded into Redshift. The default format used for storing temp data between Apache Spark and Redshift is Spark-Avro. However, Spark-Avro stores a decimal as a binary, which is interpreted by Redshift as empty strings or nulls.
Solution
Change the temp file format to CSV using the tempformat
option. You can use this sample Scala code:
//Create sample data
case class createDec(value: BigDecimal)
val df = Seq(createDec(45.24)).toDS
//Write to Redshift
(df.write
.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)
.option("tempdir", tempDir)
.option("dbtable", "testtable")
.option("aws_iam_role", "your_aws_iam_role")
.option("tempformat", "CSV")
.mode("append")
.save())