Error when trying to parse XML in a shared mode cluster using the from_xml()function

Define an XML schema in a Data Definition Language (DDL) string first.

Written by Raghavan Vaidhyaraman

Last published at: January 17th, 2025

Problem

When you try to parse XML by passing in a StructType using the from_xml()function in a shared mode cluster, you receive an error.

 

Error message example

[PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: XXXXX
File <command-906237603249083>, line 11
      1 bookschema = StructType([
      2     StructField("_id", StringType(), True),
      3     StructField("author", StringType(), True),
      4     StructField("title", StringType(), True),
      5 ])
      7 parsed_df = (
      8     raw_df
      9     .withColumn("parsedxml", from_xml(raw_df.bookxmlstr, bookschema))
     10 )
---> 11 parsed_df.display()

 

Cause

StructType is not supported in shared cluster mode. 

 

Solution

Define your desired XML schema in a Data Definition Language (DDL) string instead of a StructType, and then pass it to from_xml() function. You can use the following example. Make sure the fields, tags, and attribute names in the DDL align with your XML file. 

 

Example

The following code snippet defines rows with IDs and an XML string, then creates a list of tuples where each tuple is the row ID and XML string. It then creates a PySpark DataFrame from the list of tuples and defines a schema as a DDL string. Last, it parses the XML column and displays the parsed DataFrame.  

 

from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql.functions import from_xml

# SAMPLE XML STRINGS (replace with your own XML content)

xml_data_1 = """
<row id="<some-unique-id>">
    <childTag>Some Value</childTag>
    <anotherTag>Another Value</anotherTag>
</row>
"""

xml_data_2 = """
<row id="<another-unique-id>">
    <childTag>Different Value</childTag>
    <anotherTag>More Data</anotherTag>
</row>
"""


# Each tuple is (<some-id-string>, <xml-string>)

data_list = [
    ("<unique-id-1>", xml_data_1),
    ("<unique-id-2>", xml_data_2),
]

# Create a PySpark DataFrame from the list of tuples

raw_df = spark.createDataFrame(data_list, ["<id-column-name>", "<xml-column-name>"])


# Define schema as a DDL string
# This describes the expected fields in the XML and their types

ddl_schema_string = "<field1> <datatype1>, <field2> <datatype2>, <field3> <datatype3>"

# Parse the XML column using 'from_xml'
parsed_df = (
    raw_df
    .withColumn("parsedxml", from_xml(raw_df["<xml-column-name>"], ddl_schema_string))
)

# Display the parsed DataFrame
parsed_df.display()