Problem
When importing views from IBM Db2 using Apache Spark, you encounter the following error in the Spark driver logs or job failure details.
Caught java.io.CharConversionException ERRORCODE=-4220, SQLSTATE=null
Cause
The IBM Db2 JCC (JDBC) driver expects character column data to conform to the database's UTF-8 code page. If any column contains invalid or malformed UTF-8 byte sequences (for example, data with characters beyond the valid Unicode range or incorrectly encoded), the driver throws a SqlException
wrapping a java.io.CharConversionException
.
Example stack trace
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by:com.ibm.db2.jcc.am.SqlException: [jcc][t4][XX][XX][X.X.X] Caught java.io.CharConversionException. See attached Throwable for details. ERRORCODE=-4220,SQLSTATE=null at com.ibm.db2.jcc.am.fd.a(fd.java:731)
Solution
To handle extended or invalid characters more gracefully, configure your cluster with the following Spark configurations so they apply to all notebooks and jobs on that cluster. These settings modify the Db2 JCC driver’s behavior to tolerate character encoding issues without failing the entire query.
spark.driver.extraJavaOptions -Ddb2.jcc.charsetDecoderEncoder=3
spark.executor.extraJavaOptions -Ddb2.jcc.charsetDecoderEncoder=3
For details on how to apply Spark configs, refer to the “Spark configuration” section of the Compute configuration reference (AWS | Azure | GCP) documentation.
For additional reference on shared Db2 driver properties, see IBM’s JDBC throws java.io.CharConversionException documentation.