Sort failed after writing partitioned data to parquet using PySpark on Databricks Runtime 13.3 LTS

Set the Apache Spark configuration to set the sorted data after writing partitioned data to parquet.

Written by mounika.tarigopula

Last published at: October 23rd, 2024

Problem 

In Databricks Runtime 13.3 LTS to 15.3, when using sortWithinPartitions to make sure the rows in each partition are ordered based on the columns, the sorted data frame looks correct when displayed, but after saving and reading it back, the sorting is lost.

Cause 

There is an issue in which the planned write local sort comes after the sortWithinPartitions local sort, and then EliminateSorts drops the first sort as unnecessary. This behavior occurs with or without Photon.

Solution

This issue is fixed in Databricks Runtime 15.4 LTS.  

 

If upgrading is not an option, set the below Apache Spark configuration as a workaround. 

 

spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", "false")