Problem
When you try to parse XML by passing in a StructType using the from_xml()function
in a shared mode cluster, you receive an error.
Error message example
[PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: XXXXX
File <command-906237603249083>, line 11
1 bookschema = StructType([
2 StructField("_id", StringType(), True),
3 StructField("author", StringType(), True),
4 StructField("title", StringType(), True),
5 ])
7 parsed_df = (
8 raw_df
9 .withColumn("parsedxml", from_xml(raw_df.bookxmlstr, bookschema))
10 )
---> 11 parsed_df.display()
Cause
StructType is not supported in shared cluster mode.
Solution
Define your desired XML schema in a Data Definition Language (DDL) string instead of a StructType, and then pass it to from_xml()
function. You can use the following example. Make sure the fields, tags, and attribute names in the DDL align with your XML file.
Example
The following code snippet defines rows with IDs and an XML string, then creates a list of tuples where each tuple is the row ID and XML string. It then creates a PySpark DataFrame from the list of tuples and defines a schema as a DDL string. Last, it parses the XML column and displays the parsed DataFrame.
from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql.functions import from_xml
# SAMPLE XML STRINGS (replace with your own XML content)
xml_data_1 = """
<row id="<some-unique-id>">
<childTag>Some Value</childTag>
<anotherTag>Another Value</anotherTag>
</row>
"""
xml_data_2 = """
<row id="<another-unique-id>">
<childTag>Different Value</childTag>
<anotherTag>More Data</anotherTag>
</row>
"""
# Each tuple is (<some-id-string>, <xml-string>)
data_list = [
("<unique-id-1>", xml_data_1),
("<unique-id-2>", xml_data_2),
]
# Create a PySpark DataFrame from the list of tuples
raw_df = spark.createDataFrame(data_list, ["<id-column-name>", "<xml-column-name>"])
# Define schema as a DDL string
# This describes the expected fields in the XML and their types
ddl_schema_string = "<field1> <datatype1>, <field2> <datatype2>, <field3> <datatype3>"
# Parse the XML column using 'from_xml'
parsed_df = (
raw_df
.withColumn("parsedxml", from_xml(raw_df["<xml-column-name>"], ddl_schema_string))
)
# Display the parsed DataFrame
parsed_df.display()