Problem
Spark job fails with an exception containing the message:
Invalid UTF-32 character 0x1414141(above 10ffff) at char #1, byte #7) At org.apache.spark.sql.catalyst.json.JacksonParser.parse
Cause
The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files.
However, BOM is not mandatory by Unicode standard and prohibited by RFC 7159.
For example, section 8.1 says, "Implementations MUST NOT add a byte order mark to the beginning of a JSON text."
As a consequence, Spark is not always able to detect the charset correctly and read the JSON file.
Solution
To solve the issue, disable the charset auto-detection mechanism and explicitly set the charset using the encoding option:
%scala .option("encoding", "UTF-16LE")