How to handle blob data contained in an XML file

Learn how to handle blob data contained in an XML file.

Written by Adam Pavlacka

Last published at: March 4th, 2022

If you log events in XML format, then every XML event is recorded as a base64 string. In order to run analytics on this data using Apache Spark, you need to use the spark_xml library and the BASE64DECODER API to transform the data for analysis.


You need to analyze base64-encoded strings from an XML-formatted log file using Spark. For example, the following file input.xml shows this type of format:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log [<!ENTITY % log SYSTEM "instance">%log;]>
<log systemID="MF2018" timeZone="UTC" timeStamp="Mon Mar 25 16:00:01 2018">
  <message source="message.log" time="Mon Mar 25 16:00:01 2018" type="sysMSG"><text/>


To parse the XML file:

  1. Load the XML data.
  2. Use the spark_xml library and create a raw DataFrame.
  3. Apply a base64 decoder on the blob column using the BASE64Decoder API.
  4. Save the decoded data in a text file (optional).
  5. Load the text file using the Spark DataFrame and parse it.
  6. Create the DataFrame as a Spark SQL table.

The following Scala code processes the file:

val xmlfile = "/mnt/<path>/input.xml"
val readxml ="com.databricks.spark.xml").option("rowTag","message").load(xmlfile)

val decoded = readxml.selectExpr("_source as source","_time as time","_type as type","detail.blob") //Displays the raw blob data

//Apply base64 decoder on every piece of blob data as shown below
val decodethisxmlblob = decoded.rdd
    .map(str => str(3).toString)
    .map(str1 => new String(new sun.misc.BASE64Decoder()

//Store it in a text file temporarily

//Parse the text file as required using Spark DataFrame.

val readAsDF = spark.sparkContext.textFile("/mnt/vgiri/ec2blobtotxt")
val header = readAsDF.first()
val finalTextFile = readAsDF.filter(row => row != header)

val finalDF = finalTextFile.toDF()
    ("split(value, ',')[0] as instanceId"),
    ("split(value, ',')[1] as startTime"),
    ("split(value, ',')[2] as deleteTime"),
    ("split(value, ',')[3] as hours")

The Spark code generates the following output:

18/03/24 22:54:31 INFO DAGScheduler: ResultStage 4 (show at SparkXMLBlob.scala:42) finished in 0.016 s
18/03/24 22:54:31 INFO DAGScheduler: Job 4 finished: show at SparkXMLBlob.scala:42, took 0.019120 s
18/03/24 22:54:31 INFO SparkContext: Invoking stop() from shutdown hook
| instanceId        | startTime   | deleteTime  |hours|
|i-027fa7ccda210b4f4|2/17/17T20:21|2/17/17T21:11|    5|
|i-07cd7100e3f54bf6a|2/17/17T20:19|2/17/17T21:11|    4|
|i-0a2c4adbf0dc2551c|2/17/17T20:19|2/17/17T21:11|    2|
|i-0b40b16236388973f|2/17/17T20:18|2/17/17T21:11|    6|
|i-0cfd880722e15f19e|2/17/17T20:18|2/17/17T21:11|    2|
|i-0cf0c73efeea14f74|2/17/17T16:21|2/17/17T17:11|    1|
|i-0505e95bfebecd6e6|2/17/17T16:21|2/17/17T17:11|    8|

Was this article helpful?