Problem
When you submit queries through a workbench or local environment, they typically take less than ten minutes. However, when executing Redshift queries using the Lakehouse Federation in Databricks, you notice a delay of approximately 30 minutes.
Cause
The query involves projecting timestamps and performing joins on timestamp columns.
Databricks is designed using the v2 optimization rule (v2 rule), which does not support projecting timestamps and performing joins on timestamp columns.
When the v2 rule can’t handle the query, it attempts to push the operation down to the Redshift optimization, which does not support the timestamp type, causing the entire optimization rule to fail. Because the query structure and the available optimization rules are incompatible, the entire optimization process fails. The query execution falls back to the original, unoptimized plan and causes performance issues.
Solution
Adjust the JDBC fetchsize parameter
Adjusting the fetchsize
parameter helps mitigate the performance impact caused by the fallback to an unoptimized plan. Processing records in smaller batches based on the fetchsize
parameter improves performance compared to loading all data at once.
Experiment with the fetchsize
parameter to find the optimal value, which may require trial and error. Configure fetchsize
within your spark.read.option
setting. In the following example, fetchsize
is set to 1000
rows.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://<your-cluster-endpoint>/<your-db-name>") \
.option("dbtable", "<your-table-name>”) \
.option("user", "<your-username>") \
.option("password", "<your-password>") \
.option("fetchsize", "1000") \
.load()
For more information on JDBC fetch size, review Amazon’s Troubleshooting connection issues in Amazon Redshift documentation.
Use the COPY UNLOAD method
The COPY UNLOAD
approach leverages Redshift's efficient UNLOAD
command to export data to S3 and then reads it into Apache Spark, which optimizes query performance for large datasets.
First unload the data from Redshift to S3 in AWS.
UNLOAD ('SELECT * FROM <your-table-name>')
TO 's3://<your-bucket-name>/unload/'
authorization <your-IAM-role>;
Then read the unloaded data from S3 into Spark in Databricks.
df = spark.read.csv("s3://<your-bucket-name>/unload/")
For more information on the UNLOAD
method, review Amazon’s UNLOAD examples documentation.