Problem
You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error.
Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/
Reproduce error
- Create two tables.
%python from pyspark.sql.functions import * df = spark.range(12000) df = df.withColumn("col2",lit("test")) df.createOrReplaceTempView("table1") df1 = spark.range(5) df1.createOrReplaceTempView("table2")
- Perform left outer join on the tables.
%sql select * from table1 t1 left join table2 t2 on t1.id = t2.id
- Click Download preview. A CSV file downloads.
- Click Download full results. An error is generated.
Cause
- Download preview works because this is a frontend only operation that runs in the browser. No constraints are checked and only 1000 rows are included in the CSV file.
- Download full results re-executes the query in Apache Spark and writes the CSV file internally. The error occurs when duplicate columns are found after a join operation.
Solution
Option 1
If you select all the required columns, and avoid duplicate columns after the join operation, you will not get the error and can download the full result.
%sql select t1.id, t1.col2 from table1 t1 left join table2 t2 on t1.id = t2.id
Option 2
You can use DataFrames to prevent duplicated columns. If there are no duplicated columns after the join operation, you will not get the error and can download the full result.
%python result_df = df.join(df1, ["id"],"left") display(result_df)