Error when downloading full results after join

If you have duplicate columns after a join, you will get an error when trying to download the full results.

Written by manjunath.swamy

Last published at: May 23rd, 2022

Problem

You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error.

Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/

Reproduce error

  1. Create two tables.
    %python
    
    from pyspark.sql.functions import *
    
    df = spark.range(12000)
    df = df.withColumn("col2",lit("test"))
    df.createOrReplaceTempView("table1")
    
    df1 = spark.range(5)
    df1.createOrReplaceTempView("table2")
  2. Perform left outer join on the tables.
    %sql
    
    select * from table1 t1 left join table2 t2 on t1.id = t2.id
  3. Click Download preview. A CSV file downloads.
  4. Click Download full results. An error is generated.

Cause

  • Download preview works because this is a frontend only operation that runs in the browser. No constraints are checked and only 1000 rows are included in the CSV file.
  • Download full results re-executes the query in Apache Spark and writes the CSV file internally. The error occurs when duplicate columns are found after a join operation.

Solution

Option 1

If you select all the required columns, and avoid duplicate columns after the join operation, you will not get the error and can download the full result.

%sql

select t1.id, t1.col2 from table1 t1 left join table2 t2 on t1.id = t2.id

Option 2

You can use DataFrames to prevent duplicated columns. If there are no duplicated columns after the join operation, you will not get the error and can download the full result.

%python

result_df = df.join(df1, ["id"],"left")
display(result_df)