How to improve performance with bucketing

Learn how to improve Databricks performance by using bucketing.

Written by Adam Pavlacka

Last published at: February 29th, 2024

Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and sorting, but for certain data transformations, this technique can improve performance by avoiding later shuffling and sorting.

This technique is useful for dimension tables, which are frequently used tables containing primary keys. It is also useful when there are frequent join operations involving large and small tables.

The example notebook below shows the differences in physical plans when performing joins of bucketed and unbucketed tables.

Bucketing example notebook

Open notebook in new tab.