Convert nested JSON to a flattened DataFrame

How to convert a flattened DataFrame to nested JSON using a nested case class.

Written by Adam Pavlacka

Last published at: May 20th, 2022

This article shows you how to flatten nested JSON, using only $"column.*" and explode methods.

Sample JSON file

Pass the sample JSON string to the reader.


val json ="""
        "id": "0001",
        "type": "donut",
        "name": "Cake",
        "ppu": 0.55,
                                        { "id": "1001", "type": "Regular" },
                                        { "id": "1002", "type": "Chocolate" },
                                        { "id": "1003", "type": "Blueberry" },
                                        { "id": "1004", "type": "Devil's Food" }
                        { "id": "5001", "type": "None" },
                        { "id": "5002", "type": "Glazed" },
                        { "id": "5005", "type": "Sugar" },
                        { "id": "5007", "type": "Powdered Sugar" },
                        { "id": "5006", "type": "Chocolate with Sprinkles" },
                        { "id": "5003", "type": "Chocolate" },
                        { "id": "5004", "type": "Maple" }

Convert to DataFrame

Add the JSON string as a collection type and pass it as an input to spark.createDataset. This converts it to a DataFrame. The JSON reader infers the schema automatically from the JSON string.

This sample code uses a list collection type, which is represented as json :: Nil. You can also use other Scala collection types, such as Seq (Scala Sequence).


import org.apache.spark.sql.functions._
import spark.implicits._
val DF= :: Nil))

Extract and flatten

Use $"column.*" and explode methods to flatten the struct and array types before displaying the flattened DataFrame.


display($"id" as "main_id",$"name",$"batters",$"ppu",explode($"topping")) // Exploding the topping column using explode as it is an array type
        .withColumn("topping_id",$"") // Extracting topping_id from col using DOT form
        .withColumn("topping_type",$"col.type") // Extracting topping_tytpe from col using DOT form
        .select($"*",$"batters.*") // Flattened the struct type batters tto array type which is batter
        .withColumn("batter_id",$"") // Extracting batter_id from col using DOT form
        .withColumn("battter_type",$"col.type") // Extracting battter_type from col using DOT form


Make sure to use $ for all column names, otherwise you may get an error message: overloaded method value select with alternatives.

Example notebook

Run the Nested JSON to DataFrame example notebook to view the sample code and results.

Was this article helpful?