Job executions failing on clusters using Docker Container Services with MalformedInputException error

Specify the correct character encoding when reading the file and change the LANG settings.

Written by G Yashwanth Kiran

Last published at: January 28th, 2025

Problem

While running jobs on clusters using Docker services, you get an error message. 

 

java.nio.charset.MalformedInputException: Input length = 1.

 

Cause

The java.nio.charset part of the error indicates character encoding issues. The Java runtime is not able to properly decode the input data with a given character set or the input is improperly formatted. For example, if the input data is in UTF-8 encoding but the runtime environment is expecting a different encoding like ASCII or Latin-1, this mismatch can cause the decoding to fail.

 

The Java runtime’s inability to decode the input data can also occur due to changes in the incoming data format that are not suitable for the existing environment configurations, or the Java runtime’s LANG settings get reset with chauffeur restarts (which, in turn, restart the driver).

 

Solution

Specify the correct character encoding when reading the file. 

 

Example with PySpark

```python
df = spark.read.option("charset", "<your-char-encoding-for-your-file>").csv("path/to/your/file.csv")
```

 

Example with Scala

```scala
val df = spark.read.option("charset", "<your-char-encoding-for-your-file>").csv("path/to/your/file.csv")
```

 

Additionally, set the following environment variables section in your cluster configuration page to ensure consistent character encoding. 

 

LANG=C.UTF-8
LC_ALL=C.UTF-8

 

Setting LANG=C.UTF-8 and LC_ALL=C.UTF-8 configures the default setting of the cluster using Docker services to use UTF-8 encoding, helping resolve character encoding and malformed input issues in Java processes.

 

Preventative measures

  • Always specify the character encoding when reading files, especially if you are unsure about the format of the incoming data.
  • Regularly review and update your cluster configurations to ensure they are compatible with the data being processed.
  • Monitor your Databricks environment for any changes in the incoming data format and adjust your configurations accordingly.