Redshift queries using Lakehouse Federation taking longer than expected

Make JDBC fetchsize adjustments or use the COPY UNLOAD method.

Written by nelavelli.durganagajahnavi

Last published at: December 19th, 2024

Problem

When you submit queries through a workbench or local environment, they typically take less than ten minutes. However, when executing Redshift queries using the Lakehouse Federation in Databricks, you notice a delay of approximately 30 minutes.

Cause

The query involves projecting timestamps and performing joins on timestamp columns.

Databricks is designed using the v2 optimization rule (v2 rule), which does not support projecting timestamps and performing joins on timestamp columns.

When the v2 rule can’t handle the query, it attempts to push the operation down to the Redshift optimization, which does not support the timestamp type, causing the entire optimization rule to fail. Because the query structure and the available optimization rules are incompatible, the entire optimization process fails. The query execution falls back to the original, unoptimized plan and causes performance issues.

Solution

Adjust the JDBC fetchsize parameter

Adjusting the fetchsize parameter helps mitigate the performance impact caused by the fallback to an unoptimized plan. Processing records in smaller batches based on the fetchsize parameter improves performance compared to loading all data at once.

Experiment with the fetchsize parameter to find the optimal value, which may require trial and error. Configure fetchsize within your spark.read.option setting. In the following example, fetchsize is set to 1000 rows.

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://<your-cluster-endpoint>/<your-db-name>") \
    .option("dbtable", "<your-table-name>”) \
    .option("user", "<your-username>") \
    .option("password", "<your-password>") \
    .option("fetchsize", "1000") \
    .load()

For more information on JDBC fetch size, review Amazon’s Troubleshooting connection issues in Amazon Redshift documentation.

Use the COPY UNLOAD method

The COPY UNLOAD approach leverages Redshift's efficient UNLOAD command to export data to S3 and then reads it into Apache Spark, which optimizes query performance for large datasets.

First unload the data from Redshift to S3 in AWS.

 UNLOAD ('SELECT * FROM <your-table-name>')
TO 's3://<your-bucket-name>/unload/'
authorization <your-IAM-role>;

Then read the unloaded data from S3 into Spark in Databricks.

df = spark.read.csv("s3://<your-bucket-name>/unload/")

For more information on the UNLOAD method, review Amazon’s UNLOAD examples documentation.

Databricks Help Center

Problem

Cause

Solution

Adjust the JDBC fetchsize parameter

Use the COPY UNLOAD method

Contact Us