Problem
Complex SQL queries using Snowflake federated tables are taking more time than you expect to run within Databricks.
Cause
Certain query components, such as joins or aggregations, are performed in Databricks instead of being pushed down to Snowflake for processing. This results in increased data transfer over the network, which can reduce efficiency.
Context
By default, the Apache Spark configuration flag spark.databricks.optimizer.aggregatePushdown.enabled
is set to true
. When enabled, Spark tries to push down partial aggregations below join nodes. However, the Snowflake connector’s pushdown strategy does not always support partial aggregations.
As a result, if the node below the join cannot be pushed down, the join itself is also not pushed down, and both are executed in Databricks instead of Snowflake.
This leads to multiple queries being sent to Snowflake, with the join and aggregation performed on the Databricks side, significantly slowing down query performance.
Solution
Set spark.databricks.optimizer.aggregatePushdown.enabled
to false
in your cluster settings. This prevents Spark from generating partial aggregate nodes, allowing both join and aggregation operations to be pushed down entirely to Snowflake. As a result, Snowflake can process the query more efficiently and return only the final aggregated results to Databricks, leading to faster query execution.
1. In the Databricks UI, navigate to the Compute menu option in the vertical menu on the left.
2. Select the cluster you are using.
3. In the Cluster Configuration tab, click the Edit button in the top right.
4. Scroll down to the Advanced Options section and click to expand.
5. Enter the below configuration in the Spark > Spark Config field.
spark.databricks.optimizer.aggregatePushdown.enabled false
6. Save the changes and restart the cluster for the new configuration to take effect.