All rows from all partitions of large table being scanned when a JOIN has been performed using the partition columns in the join key, and execution times are longer

Add static filters in the query to use Apache Spark features Dynamic Partition Pruning (DPP) and Dynamic File pruning (DFP).

Written by jayant.sharma

Last published at: April 30th, 2025

Problem

When you attempt to join a larger fact table Table1 (on the left) partitioned by <partition-column> with a dimension table Table2 (on the right), you expect only the relevant partition of the large table to be read. 

 

However, you notice the large table is scanned with all the rows from all partitions when a JOIN has been performed using the partition columns in the join key, leading to longer execution time.

 

Example code

%sql
SELECT 
  t1.id, MIN(t2.<partition-column>) AS <col-name>
FROM
  <catalog>.<schema>.Table1 t1
JOIN
  <catalog>.<schema>.Table2 t2
ON
  t1.<partition-column> = t2.<partition-column> AND
  t1.id = t2.id 
GROUP BY
  t1.id

 

Cause

The primary reason for this behavior is Dynamic Partition Pruning (DPP) and Dynamic File pruning (DFP) are not triggering.

 

Additional context

Apache Spark’s SQL optimizer includes intelligent cost-based logic to determine whether applying dynamic pruning techniques, such as Dynamic Partition Pruning (DPP) or Dynamic File Pruning (DFP), will result in performance benefits. In some cases, these optimizations are deliberately skipped if they are expected to introduce more overhead than benefit.

 

For instance, when performing a join between a large fact table and a dimension table with a high number of partitions, the optimizer may determine that executing an additional subquery to identify filter values for pruning on the build side of the join is more expensive than executing the join without pruning. If this cost-benefit analysis shows minimal or negative performance gain, Spark avoids applying dynamic pruning altogether.

 

DPP and DFP not triggering often occurs when the dimension table contains a large number of partitions and the query does not include filter conditions such as a WHERE clause. Dynamic pruning features like DPP and DFP are designed to optimize queries by reducing the amount of data read at runtime. However, their effectiveness is limited to queries that contain filter predicates. 

 

For more details, refer to the Dynamic file pruning (AWSAzureGCP) documentation.

 

Example code

%sql
SELECT * FROM
  <catalog>.<schema>.Table1 t1
WHERE 
  t1.<partition-column> >= “<value>”  —- Static filter in the query

 

Solution

Take advantage of DPP and DFP by adding static filters in the query.

%sql
SELECT 
  t1.id, MIN(

t2.<partition-column>) AS <col-name>
FROM
  <catalog>.<schema>.Table1 t1
JOIN
  <catalog>.<schema>.Table2 t2
ON
  t1.<partition-column> IN (“value1”, “value2”, …., “valueN”)
AND
  t1.id = t2.id 
GROUP BY
  t1.id

 

To read more about how dynamic file pruning works and how to analyze it from Spark UI, refer to the Databricks blog post, Faster SQL Queries on Delta Lake with Dynamic File Pruning.