How to specify skew hints in dataset and DataFrame-based join commands

Learn how to specify skew hints in Dataset and DataFrame-based join commands in Databricks.

Last published at: May 31st, 2022

When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization (AWS | Azure | GCP) is performed on the DataFrame for which you specify the skew hint.

In addition to the basic hint, you can specify the hint method with the following combinations of parameters: column name, list of column names, and column name and skew value.

DataFrame and column name. The skew join optimization is performed on the specified column of the DataFrame.
```
%python

df.hint("skew", "col1")
```
DataFrame and multiple columns. The skew join optimization is performed for multiple columns in the DataFrame.
```
%python

df.hint("skew", ["col1","col2"])
```
DataFrame, column name, and skew value. The skew join optimization is performed on the data in the column with the skew value.
```
%python

df.hint("skew", "col1", "value")
```

Example

This example shows how to specify the skew hint for multiple DataFrame objects involved in a join operation:

%scala

val joinResults = ds1.hint("skew").as("L").join(ds2.hint("skew").as("R"), $"L.col1" === $"R.col1")

Databricks Help Center

Example

Contact Us