Field name sorting changes in Apache Spark 3.x

Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically.

Written by sergios.lalas

Last published at: April 21st, 2023

Problem

When using a map transformation on a RDD using Databricks Runtime 9.1 LTS and above, the resulting schema order is different when compared to doing the same map transformation using Databricks Runtime 7.3 LTS.

Cause

Databricks Runtime 9.1 LTS and above incorporate Apache Spark 3.x. Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically. Instead, they are ordered in as entered. 

Solution

To enable Spark 2.x style row sorting set PYSPARK_ROW_FIELD_SORTING_ENABLED to true in your cluster's Spark config (AWS | Azure | GCP).

PYSPARK_ROW_FIELD_SORTING_ENABLED=true

For Python versions less than 3.6, the field names can only be sorted alphabetically.

Delete

Warning

This workaround is deprecated and will be removed in a future version of Spark.