Prevent duplicated columns when joining two DataFrames

Learn how to prevent duplicated columns when joining two DataFrames in Databricks.

Last published at: October 13th, 2022

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns.

Join on columns

If you join on columns, you get duplicated columns.

Scala

%scala

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")

val df = left.join(right, left.col("name") === right.col("name"))

Python

%python

llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]
left = spark.createDataFrame(llist, ['name','date','duration'])
right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])

df = left.join(right, left.name == right.name)

Solution

Specify the join column as an array type or string.

Scala

%scala

val df = left.join(right, Seq("name"))

%scala

val df = left.join(right, "name")

Python

%python

df = left.join(right, ["name"])

%python

df = left.join(right, "name")

R

First register the DataFrames as tables.

%python

left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")

%r

library(SparkR)
sparkR.session()
left <- sql("SELECT * FROM left_test_table")
right <- sql("SELECT * FROM right_test_table")

The above code results in duplicate columns. The following code does not.

%r

head(drop(join(left, right, left$name == right$name), left$name))

Join DataFrames with duplicated columns notebook

Review the Join DataFrames with duplicated columns example notebook.

Databricks Knowledge Base

Contact Us

Join on columns

Solution

Scala

Python

R

Join DataFrames with duplicated columns notebook