Decreased performance when using DELETE with a subquery on Databricks Runtime 10.4 LTS

Auto optimize should be disabled when you have a DELETE with a subquery where one side is small enough to be broadcast.

Written by sergios.lalas

Last published at: April 21st, 2023

Problem

Auto optimize on Databricks (AWS | Azure | GCP) is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively.

Although auto optimize can be beneficial in many situations, you can see decreased performance on Databricks Runtime 10.4 LTS when you have a DELETE with a subquery where one side is small enough to be broadcast.

For instance the query may look like follows:

DELETE FROM <tableToDelete> 
WHERE Date = <'SampleDate'>
AND SampleID IN (
        SELECT MatchId FROM <OtherTable> WHERE MatchId = 'Value')

Cause

Optimized writes are enabled by default for DELETE with a subquery, on Databricks Runtime 10.4 LTS, on the assumption the data will be shuffled. In situations where one side is small enough to be broadcast, this does not happen and you may see a performance hit.

Solution

If you encounter this issue, and you do not want to upgrade to a newer Databricks Runtime, you should disable auto optimize in your Delta table by setting delta.autoOptimize.optimizeWrite = false in the table properties.

You should also set this value in your cluster's Spark config (AWS | Azure | GCP):

spark.databricks.delta.delete.forceOptimizedWrites = false
Delete

Info

Databricks Runtime 11.2 and above disables auto optimized writes for DELETE with subqueries by default.