Load special characters with Spark-XML

Special characters are not rendering correctly. Use charset with Spark-XML.

Written by annapurna.hiriyur

Last published at: May 19th, 2022

Problem

You have special characters in your source files and are using the OSS library Spark-XML.

The special characters do not render correctly.

For example, “CLU®” is rendered as “CLU�”.

Cause

Spark-XML supports the UTF-8 character set by default. You are using a different character set in your XML files.

Solution

You must specify the character set you are using in your XML files when reading the data.

Use the charset option to define the character set when reading an XML file with Spark-XML.

For example, if your source file is using ISO-8859-1:

%python

dfResult = spark.read.format('xml').schema(customSchema) \
.options(rowTag='Entity') \
.options(charset='ISO-8859-1')\
.load('/<path-to-xml>/<sample-file>.xml')

Review the Spark-XML README file for more information on supported options.