from_json returns null in Apache Spark 3.0

Problem

The from_json function is used to parse a JSON string and return a struct of values.

For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return.

from pyspark.sql.functions import col, from_json
display(
  df.select(col('value'), from_json(col('value'), json_df_schema, {"mode" : "PERMISSIVE"}))
)

In this example, the dataframe contains a column “value”, with the contents [{“id”:”001”,”name”:”peter”}] and the schema is StructType(List(StructField(id,StringType,true),StructField(name,StringType,true))).

This works correctly on Spark 2.4 and below (Databricks Runtime 6.4 ES and below).

* id:
  "001"
* name:
  "peter"

This returns null values on Spark 3.0 and above (Databricks Runtime 7.3 LTS and above).

* id:
  null
* name:
  null

Cause

This occurs because Spark 3.0 and above cannot parse JSON arrays as structs.

You can confirm this by running from_json in FAILFAST mode.

from pyspark.sql.functions import col, from_json
display(
  df.select(col('value'), from_json(col('value'), json_df_schema, {"mode" : "FAILFAST"}))
)

This returns an error message that defines the root cause.

Caused by: RuntimeException: Parsing JSON arrays as structs is forbidden

Solution

You must pass the schema as ArrayType instead of StructType in Databricks Runtime 7.3 LTS and above.

from pyspark.sql.types import StringType, ArrayType, StructType, StructField
schema_spark_3 = ArrayType(StructType([StructField("id",StringType(),True),StructField("name",StringType(),True)]))

from pyspark.sql.functions import col, from_json
display(
  df.select(col('value'), from_json(col('value'), schema_spark_3, {"mode" : "PERMISSIVE"}))
)

In this example code, the previous StructType schema is enclosed in ArrayType and the new schema is used with from_json.

This parses the JSON string correctly and returns the expected values.