When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization (AWS | Azure | GCP) is performed on the DataFrame for which you specify the skew hint.
In addition to the basic hint, you can specify the hint method with the following combinations of parameters: column name, list of column names, and column name and skew value.
-
DataFrame and column name. The skew join optimization is performed on the specified column of the DataFrame.
%python df.hint("skew", "col1")
-
DataFrame and multiple columns. The skew join optimization is performed for multiple columns in the DataFrame.
%python df.hint("skew", ["col1","col2"])
-
DataFrame, column name, and skew value. The skew join optimization is performed on the data in the column with the skew value.
%python df.hint("skew", "col1", "value")
Example
This example shows how to specify the skew hint for multiple DataFrame objects involved in a join operation:
%scala val joinResults = ds1.hint("skew").as("L").join(ds2.hint("skew").as("R"), $"L.col1" === $"R.col1")