Broadcast join exceeds threshold, returns out of memory error
Problem You are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table. Despite the total size exceeding the limit set by spark.sql.autoBroadcastJoinThreshold, BroadcastHashJoin is used and Apache Spark returns an OutOfMemorySparkException error. org.apache.spark.sql.execution.OutO...
Cannot grow BufferHolder; exceeds size limitation
Problem Your Apache Spark job fails with an IllegalArgumentException: Cannot grow BufferHolder error. java.lang.IllegalArgumentException: Cannot grow BufferHolder by size XXXXXXXXX because the size after growing exceeds size limitation 2147483632 Cause BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB). If a column value exceed...
Date functions only accept int values in Apache Spark 3.0
Problem You are attempting to use the date_add() or date_sub() functions in Spark 3.0, but they are returning an Error in SQL statement: AnalysisException error message. In Spark 2.4 and below, both functions work as normal. %sql select date_add(cast('1964-05-23' as date), '12.34') Cause You are attempting to use a fractional or string value as the ...
Disable broadcast when query plan has BroadcastNestedLoopJoin
This article explains how to disable broadcast when the query plan has BroadcastNestedLoopJoin in the physical plan. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error. This behavior is...
Duplicate columns in the metadata error
Problem Your Apache Spark job is processing a Delta table when the job fails with an error message. org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the metadata update: col1, col2... Cause There are duplicate column names in the Delta table. Column names that differ only by case are considered duplicate. Delta Lake is case prese...
Generate unique increasing numeric values
This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...
Error in SQL statement: AnalysisException: Table or view not found
Problem When you try to query a table or view, you get this error: AnalysisException:Table or view not found when trying to query a global temp view Cause You typically create global temp views so they can be accessed from different sessions and kept alive until the application ends. You can create a global temp view with the following statement: %s...
Error when downloading full results after join
Problem You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error. Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/ Reproduce error Create two tables.%python from pyspark.sql.functions impo...
Error when running MSCK REPAIR TABLE in parallel
Problem You are trying to run MSCK REPAIR TABLE <table-name> commands for the same table in parallel and are getting java.net.SocketTimeoutException: Read timed out or out of memory error messages. Cause When you try to add a large number of new partitions to a table with MSCK REPAIR in parallel, the Hive metastore becomes a limiting factor, a...
Find the size of a table
This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...
Inner join drops records in result
Problem You perform an inner join, but the resulting joined table is missing data. For example, assume you have two tables, orders and models. %python df_orders = spark.createDataFrame([('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima',''), ('Nissan','Altima', None)], ["Company", "Model", "Info"]...
Data is incorrect when read from Snowflake
Problem You have a job that is using Apache Spark to read from a Snowflake table, but the time data that appears in the Dataframe is incorrect. If you run the same query directly on Snowflake, the correct time data is returned. Cause The time zone value was not correctly set. A mismatch between the time zone value of the Databricks cluster and Snowf...
JDBC write fails with a PrimaryKeyViolation error
Problem You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error. Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables. Cause When Apache Spark performs a JDBC write, one par...
Query does not skip header row on external table
Problem You are attempting to query an external Hive table, but it keeps failing to skip the header row, even though TBLPROPERTIES ('skip.header.line.count'='1') is set in the HiveContext. You can reproduce the issue by creating a table with this sample code. %sql CREATE EXTERNAL TABLE school_test_score ( `school` varchar(254), `student_id` varc...
SHOW DATABASES command returns unexpected column name
Problem You are using the SHOW DATABASES command and it returns an unexpected column name. Cause The column name returned by the SHOW DATABASES command changed in Databricks Runtime 7.0. Databricks Runtime 6.4 Extended Support and below: SHOW DATABASES returns namespace as the column name. Databricks Runtime 7.0 and above: SHOW DATABASES returns dat...
Cannot view table SerDe properties
Problem You are trying to view the SerDe properties on an Apache Hive table, but SHOW CREATE TABLE just returns the Apache Spark DDL. It does not show the SerDe properties. For example, given this sample code: %sql SHOW CREATE TABLE <table-identifier> You get a result that does not show the SerDe properties: Cause You are using Databricks Runt...