Problem
The from_json function is used to parse a JSON string and return a struct of values.
For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return.
%python from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(col('value'), json_df_schema, {"mode" : "PERMISSIVE"})) )
In this example, the dataframe contains a column “value”, with the contents [{“id”:”001”,”name”:”peter”}] and the schema is StructType(List(StructField(id,StringType,true),StructField(name,StringType,true))).
This works correctly on Spark 2.4 and below (Databricks Runtime 6.4 ES and below).
* id: "001" * name: "peter"
This returns null values on Spark 3.0 and above (Databricks Runtime 7.3 LTS and above).
* id: null * name: null
Cause
This occurs because Spark 3.0 and above cannot parse JSON arrays as structs.
You can confirm this by running from_json in FAILFAST mode.
%python from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(col('value'), json_df_schema, {"mode" : "FAILFAST"})) )
This returns an error message that defines the root cause.
Caused by: RuntimeException: Parsing JSON arrays as structs is forbidden
Solution
You must pass the schema as ArrayType instead of StructType in Databricks Runtime 7.3 LTS and above.
%python from pyspark.sql.types import StringType, ArrayType, StructType, StructField schema_spark_3 = ArrayType(StructType([StructField("id",StringType(),True),StructField("name",StringType(),True)])) from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(col('value'), schema_spark_3, {"mode" : "PERMISSIVE"})) )
In this example code, the previous StructType schema is enclosed in ArrayType and the new schema is used with from_json.
This parses the JSON string correctly and returns the expected values.