Prevent duplicated columns when joining two DataFrames
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns.
Join on columns
If you join on columns, you get duplicated columns.
Scala
%scala
val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")
val df = left.join(right, left.col("name") === right.col("name"))
Python
%python
llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]
left = spark.createDataFrame(llist, ['name','date','duration'])
right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])
df = left.join(right, left.name == right.name)
Solution
Specify the join column as an array type or string.
Scala
%scala
val df = left.join(right, Seq("name"))
%scala
val df = left.join(right, "name")
Python
%python
df = left.join(right, ["name"])
%python
df = left.join(right, "name")
R
First register the DataFrames as tables.
%python
left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")
%r
library(SparkR)
sparkR.session()
left <- sql("SELECT * FROM left_test_table")
right <- sql("SELECT * FROM right_test_table")
The above code results in duplicate columns. The following code does not.
%r
head(drop(join(left, right, left$name == right$name), left$name))