from_json returns null in Apache Spark 3.0

Spark 3.0 and above cannot parse JSON arrays as structs; from_json returns null.

Written by shanmugavel.chandrakasu

Last published at: May 23rd, 2022

Problem

The from_json function is used to parse a JSON string and return a struct of values.

For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return.

%python

from pyspark.sql.functions import col, from_json
display(
  df.select(col('value'), from_json(col('value'), json_df_schema, {"mode" : "PERMISSIVE"}))
)

In this example, the dataframe contains a column “value”, with the contents [{“id”:”001”,”name”:”peter”}] and the schema is StructType(List(StructField(id,StringType,true),StructField(name,StringType,true))).

This works correctly on Spark 2.4 and below (Databricks Runtime 6.4 ES and below).

* id:
  "001"
* name:
  "peter"

This returns null values on Spark 3.0 and above (Databricks Runtime 7.3 LTS and above).

* id:
  null
* name:
  null

Cause

This occurs because Spark 3.0 and above cannot parse JSON arrays as structs.

You can confirm this by running from_json in FAILFAST mode.

%python

from pyspark.sql.functions import col, from_json
display(
  df.select(col('value'), from_json(col('value'), json_df_schema, {"mode" : "FAILFAST"}))
)

This returns an error message that defines the root cause.

Caused by: RuntimeException: Parsing JSON arrays as structs is forbidden

Solution

You must pass the schema as ArrayType instead of StructType in Databricks Runtime 7.3 LTS and above.

%python

from pyspark.sql.types import StringType, ArrayType, StructType, StructField
schema_spark_3 = ArrayType(StructType([StructField("id",StringType(),True),StructField("name",StringType(),True)]))


from pyspark.sql.functions import col, from_json
display(
  df.select(col('value'), from_json(col('value'), schema_spark_3, {"mode" : "PERMISSIVE"}))
)

In this example code, the previous StructType schema is enclosed in ArrayType and the new schema is used with from_json.

This parses the JSON string correctly and returns the expected values.