Assume you have two Delta tables test_table_1 and test_table_2. Both tables have the same schema, same data volume, same partitions, and contain the same number of files. You are doing a join transformation with another Delta table, test_table_join, which has a million records.
When you run the below join queries using test_table_1 and test_table_2, different physical plans are generated, even if both tables are identical.
%sql SELECT COUNT(A.*) FROM TEST_TABLE_1 A INNER JOIN TEST_TABLE_JOIN B ON A.ID=B.ID
%sql SELECT COUNT(A.*) FROM TEST_TABLE_2 A INNER JOIN TEST_TABLE_JOIN B ON A.ID=B.ID
When different physical plans are generated for identical tables it can result in a delay as compared to other queries.
Apache Spark generates the physical plan for the transformation based on the table statistics. If best practices for Delta tables are not followed, table statistics could be different, even if the tables are otherwise identical. If the table statistics are different, Spark can generate a different plan than it might have done if both tables had the same statistics.
If you notice different physical plans are being generated for identical Delta tables there are two different solutions you can use to mitigate the issue. The solution you choose depends on the specific Databricks Runtime version you are running on your cluster.
Databricks Runtime 7.3 LTS
Use the Delta table path instead of the table name in the query. This directly reads the data from the path without checking the table statistics.
%sql SELECT COUNT(A.*) FROM delta.`path` A INNER JOIN TEST_TABLE_JOIN B ON A.ID=B.ID
Databricks Runtime 9.1 LTS and above
Run the ANALYZE TABLE (AWS | Azure | GCP) and COMPUTE STATISTICS commands on the table. This calculates the table statistics and stores them in the metadata.
%sql ANALYZE TABLE test_table_1 COMPUTE STATISTICS;
After implementing the solution, re-run both queries and use a diff checker to compare the resulting physical plans for the two Delta tables.
The plans are now identical.