Convert nested JSON to a flattened DataFrame

How to convert a flattened DataFrame to nested JSON using a nested case class.

Last published at: May 20th, 2022

This article shows you how to flatten nested JSON, using only $"column.*" and explode methods.

Sample JSON file

Pass the sample JSON string to the reader.

%scala

val json ="""
{
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
        "batters":
                {
                        "batter":
                                [
                                        { "id": "1001", "type": "Regular" },
                                        { "id": "1002", "type": "Chocolate" },
                                        { "id": "1003", "type": "Blueberry" },
                                        { "id": "1004", "type": "Devil's Food" }
                                ]
                },
        "topping":
                [
                        { "id": "5001", "type": "None" },
                        { "id": "5002", "type": "Glazed" },
                        { "id": "5005", "type": "Sugar" },
                        { "id": "5007", "type": "Powdered Sugar" },
                        { "id": "5006", "type": "Chocolate with Sprinkles" },
                        { "id": "5003", "type": "Chocolate" },
                        { "id": "5004", "type": "Maple" }
                ]
}
"""

Convert to DataFrame

Add the JSON string as a collection type and pass it as an input to spark.createDataset. This converts it to a DataFrame. The JSON reader infers the schema automatically from the JSON string.

This sample code uses a list collection type, which is represented as json :: Nil. You can also use other Scala collection types, such as Seq (Scala Sequence).

%scala

import org.apache.spark.sql.functions._
import spark.implicits._
val DF= spark.read.json(spark.createDataset(json :: Nil))

Extract and flatten

Use $"column.*" and explode methods to flatten the struct and array types before displaying the flattened DataFrame.

%scala

display(DF.select($"id" as "main_id",$"name",$"batters",$"ppu",explode($"topping")) // Exploding the topping column using explode as it is an array type
        .withColumn("topping_id",$"col.id") // Extracting topping_id from col using DOT form
        .withColumn("topping_type",$"col.type") // Extracting topping_tytpe from col using DOT form
        .drop($"col")
        .select($"*",$"batters.*") // Flattened the struct type batters tto array type which is batter
        .drop($"batters")
        .select($"*",explode($"batter"))
        .drop($"batter")
        .withColumn("batter_id",$"col.id") // Extracting batter_id from col using DOT form
        .withColumn("battter_type",$"col.type") // Extracting battter_type from col using DOT form
        .drop($"col")
       )

Delete

Warning

Make sure to use $ for all column names, otherwise you may get an error message: overloaded method value select with alternatives.

Example notebook

Run the Nested JSON to DataFrame example notebook to view the sample code and results.

Databricks Help Center

Sample JSON file

Convert to DataFrame

Extract and flatten

Warning

Example notebook

Contact Us