Behavior of the randomSplit method

Learn about inconsistent behaviors when using the randomSplit method in Databricks.

Written by Adam Pavlacka

Last published at: May 31st, 2022

When using randomSplit on a DataFrame, you could potentially observe inconsistent behavior. Here is an example:

%python

df = spark.read.format('inconsistent_data_source').load()
a,b = df.randomSplit([0.5, 0.5])
a.join(broadcast(b), on='id', how='inner').count()

Typically this query returns 0. However, depending on the underlying data source or input DataFrame, in some cases the query could result in more than 0 records.

This unexpected behavior is explained by the fact that data distribution across RDD partitions is not idempotent, and could be rearranged or updated during the query execution, thus affecting the output of the randomSplit method.

Delete

Info

Spark DataFrames and RDDs preserve partitioning order; this problem only exists when query output depends on the actual data distribution across partitions, for example, values from files 1, 2 and 3 always appear in partition 1.

The issue could also be observed when using Delta cache (AWS | Azure | GCP). All solutions listed below are still applicable in this case.

Solution

Do one of the following:

  • Use explicit Apache Spark RDD caching
    %python
    
    df = inputDF.cache()
    a,b = df.randomSplit([0.5, 0.5])
  • Repartition by a column or a set of columns
    %python
    
    df = inputDF.repartition(100, 'col1')
    a,b = df.randomSplit([0.5, 0.5])
  • Apply an aggregate function
    %python
    
    df = inputDF.groupBy('col1').count()
    a,b = df.randomSplit([0.5, 0.5])

These operations persist or shuffle data resulting in the consistent data distribution across partitions in Spark jobs.