Field name sorting changes in Apache Spark 3.x

Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically.

Last published at: April 21st, 2023

Problem

When using a map transformation on a RDD using Databricks Runtime 9.1 LTS and above, the resulting schema order is different when compared to doing the same map transformation using Databricks Runtime 7.3 LTS.

Cause

Databricks Runtime 9.1 LTS and above incorporate Apache Spark 3.x. Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically. Instead, they are ordered in as entered.

Solution

To enable Spark 2.x style row sorting set PYSPARK_ROW_FIELD_SORTING_ENABLED to true in your cluster's Spark config (AWS | Azure | GCP).

PYSPARK_ROW_FIELD_SORTING_ENABLED=true

For Python versions less than 3.6, the field names can only be sorted alphabetically.

Delete

Warning

This workaround is deprecated and will be removed in a future version of Spark.

Databricks Help Center

Problem

Cause

Solution

Warning

Contact Us