Spark job fails with an exception containing the message:
Invalid UTF-32 character 0x1414141(above 10ffff) at char #1, byte #7) At org.apache.spark.sql.catalyst.json.JacksonParser.parse
The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. However, BOM is not mandatory by Unicode standard and prohibited by RFC 7159 for example, section 8.1:
“…Implementations MUST NOT add a byte order mark to the beginning of a JSON text.”
As a consequence, in some cases Spark is not able to detect the charset correctly and read the JSON file.