Error when downloading full results after join

Problem

You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error.

Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/

Reproduce error

  1. Create two tables.

    from pyspark.sql.functions import *
    
    df = spark.range(12000)
    df = df.withColumn("col2",lit("test"))
    df.createOrReplaceTempView("table1")
    
    df1 = spark.range(5)
    df1.createOrReplaceTempView("table2")
    
  2. Perform left outer join on the tables.

    select * from table1 t1 left join table2 t2 on t1.id = t2.id
    
  3. Click Download preview. A CSV file downloads.

  4. Click Download full results. An error is generated.

Cause

  • Download preview works because this is a frontend only operation that runs in the browser. No constraints are checked and only 1000 rows are included in the CSV file.
  • Download full results re-executes the query in Apache Spark and writes the CSV file internally. The error occurs when duplicate columns are found after a join operation.

Solution

Option 1

If you select all the required columns, and avoid duplicate columns after the join operation, you will not get the error and can download the full result.

select t1.id, t1.col2 from table1 t1 left join table2 t2 on t1.id = t2.id

Option 2

You can use DataFrames to prevent duplicated columns. If there are no duplicated columns after the join operation, you will not get the error and can download the full result.

result_df = df.join(df1, ["id"],"left")
display(result_df)