Problem
Auto optimize on Databricks (AWS | Azure | GCP) is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively.
Although auto optimize can be beneficial in many situations, you can see decreased performance on Databricks Runtime 10.4 LTS when you have a DELETE with a subquery where one side is small enough to be broadcast.
For instance the query may look like follows:
DELETE FROM <tableToDelete> WHERE Date = <'SampleDate'> AND SampleID IN ( SELECT MatchId FROM <OtherTable> WHERE MatchId = 'Value')
Cause
Optimized writes are enabled by default for DELETE with a subquery, on Databricks Runtime 10.4 LTS, on the assumption the data will be shuffled. In situations where one side is small enough to be broadcast, this does not happen and you may see a performance hit.
Solution
If you encounter this issue, and you do not want to upgrade to a newer Databricks Runtime, you should disable auto optimize in your Delta table by setting delta.autoOptimize.optimizeWrite = false in the table properties.
You should also set this value in your cluster's Spark config (AWS | Azure | GCP):
spark.databricks.delta.delete.forceOptimizedWrites = false