JSON reader parses values as null

Problem

You are attempting to read a JSON file.

You know the file has data in it, but the Apache Spark JSON reader is returning a null value.

Example code

You can use this example code to reproduce the problem.

  1. Create a test JSON file in DBFS.

    dbutils.fs.rm("dbfs:/tmp/json/parse_test.txt")
    dbutils.fs.put("dbfs:/tmp/json/parse_test.txt",
    """
    {"data_flow":{"upstream":[{"$":{"source":"input"},"cloud_type":""},{"$":{"source":"File"},"cloud_type":{"azure":"cloud platform","aws":"cloud service"}}]}}
    """)
    
  2. Read the JSON file.

    jsontest = spark.read.option("inferSchema","true").json("dbfs:/tmp/json/parse_test.txt")
    display(jsontest)
    
  3. The result is a null value.

    jsontest results showing null value.

Cause

  • In Spark 2.4 and below, the JSON parser allows empty strings. Only certain data types, such as IntegerType are treated as null when empty.
  • In Spark 3.0 and above, the JSON parser does not allow empty strings. An exception is thrown for all data types, expect BinaryType and StringType.

For more information, review the Spark SQL Migration Guide.

Example code

The example code shows the error because the data has two identical classification fields.

The first cloud_type entry is an empty string. The second cloud_type entry has data.

"cloud_type":""
"cloud_type":{"azure":"cloud platform","aws":"cloud service"}

Because the JSON parser does not allow empty strings in Spark 3.0 and above, a null value is returned as output.

Solution

Set the Spark configuration value spark.sql.legacy.json.allowEmptyString.enabled to True. This configures the Spark 3.0 JSON parser to allow empty strings.

You can set this configuration at the cluster level or the notebook level.

Example code

spark.conf.set("spark.sql.legacy.json.allowEmptyString.enabled", True)

jsontest1 = spark.read.option("inferSchema","true").json("dbfs:/tmp/json/parse_test.txt")
display(jsontest1)

jsontest results showing actual value.