If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns.
Join on columns
If you join on columns, you get duplicated columns.
Scala
%scala val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10)) val left = llist.toDF("name","date","duration") val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload") val df = left.join(right, left.col("name") === right.col("name"))
Python
%python llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)] left = spark.createDataFrame(llist, ['name','date','duration']) right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload']) df = left.join(right, left.name == right.name)
Solution
Specify the join column as an array type or string.
Scala
%scala val df = left.join(right, Seq("name"))
%scala val df = left.join(right, "name")
Python
%python df = left.join(right, ["name"])
%python df = left.join(right, "name")
R
First register the DataFrames as tables.
%python left.createOrReplaceTempView("left_test_table") right.createOrReplaceTempView("right_test_table")
%r library(SparkR) sparkR.session() left <- sql("SELECT * FROM left_test_table") right <- sql("SELECT * FROM right_test_table")
The above code results in duplicate columns. The following code does not.
%r head(drop(join(left, right, left$name == right$name), left$name))
Join DataFrames with duplicated columns notebook
Review the Join DataFrames with duplicated columns example notebook.