Different tables with same data generate different plans when used in same query

Ensure that tables with the same data generate the same physical plans with Spark SQL.

Written by deepak.bhutada

Last published at: October 14th, 2022

Problem

Assume you have two Delta tables test_table_1 and test_table_2. Both tables have the same schema, same data volume, same partitions, and contain the same number of files. You are doing a join transformation with another Delta table, test_table_join, which has a million records.

When you run the below join queries using test_table_1 and test_table_2, different physical plans are generated, even if both tables are identical.

Query 1:

%sql

SELECT COUNT(A.*) FROM TEST_TABLE_1 A INNER JOIN TEST_TABLE_JOIN B ON 
A.ID=B.ID

Query 2:

%sql

SELECT COUNT(A.*) FROM TEST_TABLE_2 A INNER JOIN TEST_TABLE_JOIN B ON 
A.ID=B.ID

When different physical plans are generated for identical tables it can result in a delay as compared to other queries.

Cause

Apache Spark generates the physical plan for the transformation based on the table statistics. If best practices for Delta tables are not followed, table statistics could be different, even if the tables are otherwise identical. If the table statistics are different, Spark can generate a different plan than it might have done if both tables had the same statistics.

Solution

If you notice different physical plans are being generated for identical Delta tables there are two different solutions you can use to mitigate the issue. The solution you choose depends on the specific Databricks Runtime version you are running on your cluster.

Databricks Runtime 7.3 LTS

Use the Delta table path instead of the table name in the query. This directly reads the data from the path without checking the table statistics.

%sql

SELECT COUNT(A.*) FROM delta.`path` A INNER JOIN TEST_TABLE_JOIN B ON 
A.ID=B.ID

Databricks Runtime 9.1 LTS and above

Run the ANALYZE TABLE (AWS | Azure | GCP) and COMPUTE STATISTICS commands on the table. This calculates the table statistics and stores them in the metadata.

%sql

ANALYZE TABLE test_table_1 COMPUTE STATISTICS;


After implementing the solution, re-run both queries and use a diff checker to compare the resulting physical plans for the two Delta tables.

The plans are now identical.


Was this article helpful?