Job failing during batch processing with PartitioningCollection error

Remove unnecessary data caches and circular dependency of DataFrames.

Written by Vidhi Khaitan

Last published at: August 18th, 2025

Problem

During batch processing, your job fails with the following error.

"PartitioningCollection requires all of its partitionings have the same numPartitions"

 

Cause

There is a mismatch in the number of partitions across different operations or stages within the Apache Spark application.

 

In a workload, batch processing involves repeatedly reading, transforming, and caching intermediate DataFrames for reuse across multiple micro-batches.

 

When a DataFrame is cached and later reused in operations that eventually lead back to the same transformation path, Spark encounters cyclic references and inconsistent partition plans. This leads to a mismatch between the cached DataFrame's partitioning and the expected partitioning in subsequent operations.

 

Differences in the number of partitions between joined or unioned DataFrames can cause execution plan conflicts.

 

Solution

First, do not cache DataFrames that are part of circular transformations. For example, the following code contains df.cache()

df = df.transform(...)
df.cache()
df = df.join(df2, ...)

 

Remove the df.cache() line.

df = df.transform(...)
df = df.join(df2, ...)

 

Use .repartition(n) or .coalesce(n) explicitly before key operations like join or union to ensure the same number of partitions.

num_partitions = 200
df1 = df1.repartition(num_partitions)
df2 = df2.repartition(num_partitions)
joined_df = df1.join(df2, "id")