Load special characters with Spark-XML

Problem

You have special characters in your source files and are using the OSS library Spark-XML.

The special characters do not render correctly.

For example, “CLU®” is rendered as “CLU�”.

Cause

Spark-XML supports the UTF-8 character set by default. You are using a different character set in your XML files.

Solution

You must specify the character set you are using in your XML files when reading the data.

Use the charset option to define the character set when reading an XML file with Spark-XML.

For example, if your source file is using ISO-8859-1:

dfResult = spark.read.format('xml').schema(customSchema) \
.options(rowTag='Entity') \
.options(charset='ISO-8859-1')\
.load('/<path-to-xml>/<sample-file>.xml')

Review the Spark-XML README file for more information on supported options.