Databricks Knowledge Base

Main Navigation

  • Help Center
  • Documentation
  • Knowledge Base
  • Community
  • Training
  • Feedback

SQL with Apache Spark (AWS)

These articles can help you to use SQL with Apache Spark.

15 Articles in this category

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.

  • Home
  • Amazon
  • SQL with Apache Spark (AWS)

Broadcast join exceeds threshold, returns out of memory error

Problem You are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table. Despite the total size exceeding the limit set by spark.sql.autoBroadcastJoinThreshold, BroadcastHashJoin is used and Apache Spark returns an OutOfMemorySparkException error. org.apache.spark.sql.execution.OutO...

Last updated: May 23rd, 2022 by sandeep.chandran

Cannot grow BufferHolder; exceeds size limitation

Problem Your Apache Spark job fails with an IllegalArgumentException: Cannot grow BufferHolder error. java.lang.IllegalArgumentException: Cannot grow BufferHolder by size XXXXXXXXX because the size after growing exceeds size limitation 2147483632 Cause BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB). If a column value exceed...

Last updated: May 23rd, 2022 by Adam Pavlacka

Date functions only accept int values in Apache Spark 3.0

Problem You are attempting to use the date_add() or date_sub() functions in Spark 3.0, but they are returning an Error in SQL statement: AnalysisException error message. In Spark 2.4 and below, both functions work as normal. %sql select date_add(cast('1964-05-23' as date), '12.34') Cause You are attempting to use a fractional or string value as the ...

Last updated: May 23rd, 2022 by Adam Pavlacka

Disable broadcast when query plan has BroadcastNestedLoopJoin

This article explains how to disable broadcast when the query plan has BroadcastNestedLoopJoin in the physical plan. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error. This behavior is...

Last updated: May 23rd, 2022 by Adam Pavlacka

Duplicate columns in the metadata error

Problem Your Apache Spark job is processing a Delta table when the job fails with an error message. org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the metadata update: col1, col2... Cause There are duplicate column names in the Delta table. Column names that differ only by case are considered duplicate. Delta Lake is case prese...

Last updated: May 23rd, 2022 by vikas.yadav

Generate unique increasing numeric values

This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...

Last updated: May 23rd, 2022 by ram.sankarasubramanian

Error in SQL statement: AnalysisException: Table or view not found

Problem When you try to query a table or view, you get this error: AnalysisException:Table or view not found when trying to query a global temp view Cause You typically create global temp views so they can be accessed from different sessions and kept alive until the application ends. You can create a global temp view with the following statement: %s...

Last updated: May 23rd, 2022 by Adam Pavlacka

Error when downloading full results after join

Problem You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error. Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/ Reproduce error Create two tables.%python from pyspark.sql.functions impo...

Last updated: May 23rd, 2022 by manjunath.swamy

Error when running MSCK REPAIR TABLE in parallel

Problem You are trying to run MSCK REPAIR TABLE <table-name> commands for the same table in parallel and are getting java.net.SocketTimeoutException: Read timed out or out of memory error messages. Cause When you try to add a large number of new partitions to a table with MSCK REPAIR in parallel, the Hive metastore becomes a limiting factor, a...

Last updated: May 23rd, 2022 by ashritha.laxminarayana

Find the size of a table

This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...

Last updated: May 23rd, 2022 by mathan.pillai

Inner join drops records in result

Problem You perform an inner join, but the resulting joined table is missing data. For example, assume you have two tables, orders and models. %python df_orders = spark.createDataFrame([('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima',''), ('Nissan','Altima', None)], ["Company", "Model", "Info"]...

Last updated: May 23rd, 2022 by siddharth.panchal

Data is incorrect when read from Snowflake

Problem You have a job that is using Apache Spark to read from a Snowflake table, but the time data that appears in the Dataframe is incorrect. If you run the same query directly on Snowflake, the correct time data is returned. Cause The time zone value was not correctly set. A mismatch between the time zone value of the Databricks cluster and Snowf...

Last updated: May 24th, 2022 by DD Sharma

JDBC write fails with a PrimaryKeyViolation error

Problem You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error. Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables. Cause When Apache Spark performs a JDBC write, one par...

Last updated: May 24th, 2022 by harikrishnan.kunhumveettil

Query does not skip header row on external table

Problem You are attempting to query an external Hive table, but it keeps failing to skip the header row, even though TBLPROPERTIES ('skip.header.line.count'='1') is set in the HiveContext. You can reproduce the issue by creating a table with this sample code. %sql CREATE EXTERNAL TABLE school_test_score (   `school` varchar(254),   `student_id` varc...

Last updated: May 24th, 2022 by manisha.jena

SHOW DATABASES command returns unexpected column name

Problem You are using the SHOW DATABASES command and it returns an unexpected column name. Cause The column name returned by the SHOW DATABASES command changed in Databricks Runtime 7.0. Databricks Runtime 6.4 Extended Support and below: SHOW DATABASES returns namespace as the column name. Databricks Runtime 7.0 and above: SHOW DATABASES returns dat...

Last updated: May 24th, 2022 by Jose Gonzalez


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use

Definition by Author

0
0