How to specify skew hints in dataset and DataFrame-based join commands

Learn how to specify skew hints in Dataset and DataFrame-based join commands in Databricks.

Written by Adam Pavlacka

Last published at: May 31st, 2022

When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization (AWS | Azure | GCP) is performed on the DataFrame for which you specify the skew hint.

In addition to the basic hint, you can specify the hint method with the following combinations of parameters: column name, list of column names, and column name and skew value.

  • DataFrame and column name. The skew join optimization is performed on the specified column of the DataFrame.
    %python
    
    df.hint("skew", "col1")
  • DataFrame and multiple columns. The skew join optimization is performed for multiple columns in the DataFrame.
    %python
    
    df.hint("skew", ["col1","col2"])
  • DataFrame, column name, and skew value. The skew join optimization is performed on the data in the column with the skew value.
    %python
    
    df.hint("skew", "col1", "value")

Example

This example shows how to specify the skew hint for multiple DataFrame objects involved in a join operation:

%scala

val joinResults = ds1.hint("skew").as("L").join(ds2.hint("skew").as("R"), $"L.col1" === $"R.col1")