Using collect_list after transformations such as JOIN returns inconsistent counts even though the underlying data doesn’t change

Sort the collect_list output using array_sort before performing joins.

Last published at: April 29th, 2025

Problem

When using collect_list after a transformation such as a JOIN or GROUP BY, you observe fewer resultant records than you expected, and the count varies with each execution even though the underlying data remains unchanged.

Example code

%python
from pyspark.sql import functions as F
df2=df1.groupBy(‘column_1’).agg(F.collect_list(‘column_2’).alias(‘collect_list_output’))

Cause

The Apache Spark collect_list function is non-deterministic. Its results depend on the order of rows, which may also be non-deterministic after a shuffle. This can lead to unexpected results in subsequent joins or transformations.

Solution

Modify your code to sort the list using the array_sort function after collect_list. This ensures a consistent order in the derived column, which can help to prevent unexpected results in subsequent joins or transformations.

Example code - modified

%python
df2=df1.groupBy(‘column_1’).agg(F.array_sort(collect_list(‘column_2’)).alias(‘sorted_collect_list_output’))

Preventative measures

Ensure your rows and transformations are deterministic by implementing changes in your code based on your specific use case, such as ordering the rows, especially if you also use other non-deterministic functions, such as collect_set(), first(), last() and window functions like row_number() with duplicate ordering keys.

The solution provided in this article is scoped to collect_list.

Databricks Help Center