SQL with Apache Spark

Broadcast join exceeds threshold, returns out of memory error

Resolve an Apache Spark OutOfMemorySparkException error that occurs when a table using BroadcastHashJoin exceeds the BroadcastJoinThreshold....

Last updated: May 23rd, 2022 by sandeep.chandran

Cannot grow BufferHolder; exceeds size limitation

Cannot grow BufferHolder by size because the size after growing exceeds limitation; java.lang.IllegalArgumentException error....

Last updated: May 23rd, 2022 by Adam Pavlacka

Date functions only accept int values in Apache Spark 3.0

Date functions only accept int values in Apache Spark 3.0; fractional and string values return AnalysisException error....

Last updated: February 28th, 2023 by Adam Pavlacka

Disable broadcast when query plan has BroadcastNestedLoopJoin

How to disable broadcast when the query plan has BroadcastNestedLoopJoin....

Last updated: May 23rd, 2022 by Adam Pavlacka

Duplicate columns in the metadata error

Spark job fails while processing a Delta table with org.apache.spark.sql.AnalysisException Found duplicate column(s) in the metadata error....

Last updated: May 23rd, 2022 by vikas.yadav

Generate unique increasing numeric values

Use Apache Spark functions to generate unique and increasing numbers in a column in a table in a file or DataFrame....

Last updated: May 23rd, 2022 by ram.sankarasubramanian

Error in SQL statement: AnalysisException: Table or view not found

Learn how to resolve the AnalysisException SQL error "Table or view not found"....

Last updated: May 23rd, 2022 by Adam Pavlacka

Error when downloading full results after join

If you have duplicate columns after a join, you will get an error when trying to download the full results....

Last updated: May 23rd, 2022 by manjunath.swamy

Error when running MSCK REPAIR TABLE in parallel

Do not run `MSCK REPAIR` commands in parallel. It results in a read timed out or out of memory error message....

Last updated: May 23rd, 2022 by ashritha.laxminarayana

Find the size of a table snapshot

How to find the size of a table....

Last updated: July 14th, 2025 by mathan.pillai

Inner join drops records in result

Avoid dropped records when performing an inner join....

Last updated: May 23rd, 2022 by siddharth.panchal

Data is incorrect when read from Snowflake

Data read from Snowflake is incorrect when time zone value is not set correctly....

Last updated: May 24th, 2022 by DD Sharma

JDBC write fails with a PrimaryKeyViolation error

JDBC write to a SQL database fails with a `PrimaryKeyViolation` error or results in duplicate data...

Last updated: May 24th, 2022 by harikrishnan.kunhumveettil

Query does not skip header row on external table

External Hive tables do not skip the header row when queried from Spark SQL....

Last updated: May 24th, 2022 by manisha.jena

SHOW DATABASES command returns unexpected column name

Running the `SHOW DATABASES` command returns an unexpected column name....

Last updated: May 24th, 2022 by Jose Gonzalez

Cannot view table SerDe properties

SHOW CREATE TABLE only returns the Apache Spark DDL. It does not show the SerDe properties....

Last updated: July 1st, 2022 by saritha.shivakumar

Parsing post meridiem time (PM) with to_timestamp() returns null

When converting 12-hour time to 24-hour time with to_timestamp() the hours variable must be lowercase....

Last updated: July 22nd, 2022 by chetan.kardekar

to_json() results in Cannot use null as map key error

You must filter or replace null values in your input data before using to_json()....

Last updated: July 22nd, 2022 by gopal.goel

Set nullability when using SaveAsTable with Delta tables

Learn how to create a Delta table with the nullability of columns set to false....

Last updated: October 14th, 2022 by anshuman.sahu

Ensure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above

Statistics functions in Databricks Runtime 7.3 LTS and below return NaN when a divide by zero occurs. Set a Spark config to return null instead....

Last updated: October 14th, 2022 by chetan.kardekar

Using datetime values in Spark 3.0 and above

How to correctly use datetime functions in Spark SQL with Databricks runtime 7.3 LTS and above....

Last updated: October 26th, 2022 by deepak.bhutada

ANSI compliant DECIMAL precision and scale

Learn how to enable ANSI compliant error messages when incorrect values are used for DECIMAL precision and scale....

Last updated: October 29th, 2022 by saritha.shivakumar

Recreate LISTAGG functionality with Spark SQL

Use collect_list and concat_ws in Spark SQL to achieve the same functionality as LISTAGG on other platforms....

Last updated: February 24th, 2023 by manjunath.swamy

Decreased performance when using DELETE with a subquery on Databricks Runtime 10.4 LTS

Auto optimize should be disabled when you have a DELETE with a subquery where one side is small enough to be broadcast....

Last updated: April 21st, 2023 by sergios.lalas

Automatic VACUUM on write does not work with non-Delta tables

Manually run VACUUM to clear uncommitted files from the entire table....

Last updated: September 12th, 2024 by nikhil.jain

Production environment will not connect to Sybase

Set 'LIMIT' as a reserved keyword in Sybase....

Last updated: September 12th, 2024 by raphael.balogo

LEFT JOIN resulting in null values when joining timestamp column and date column

Cast the value of the timestamp column to date datatype when joining it with a column of 'date' datatype....

Last updated: September 12th, 2024 by ram.sankarasubramanian

Handling case sensitivity issues in Delta Lake nested fields

Set a specific property in your Spark configuration to handle the case sensitivity of nested fields in Delta tables. ...

Last updated: September 12th, 2024 by Rajeev kannan Thangaiah

Trailing zeros in decimal values appear when reading Parquet files in Apache Spark

Use the format_number function to format decimal values without altering data precision....

Last updated: December 23rd, 2024 by nelavelli.durganagajahnavi

SQL transformations involving timestamp columns giving different results in an interactive cluster versus serverless compute

Use SQL’s type casting to handle precision or upgrade your JDK version. ...

Last updated: December 26th, 2024 by jayant.sharma

Job failures when running Apache Spark jobs processing MongoDB data

Validate your source data to make sure data types match, or disable ANSI compliance in Spark SQL....

Last updated: January 17th, 2025 by manikandan.ganesan

Job ID column not consistently showing values in the Apache Spark UI for Sub Execution IDs

Click on a Sub Execution ID column and then click again to sort and see all the IDs together....

Last updated: January 21st, 2025 by Raghavan Vaidhyaraman

COPY INTO command failing on partition columns with STRING data types that start with an integer

Disable partition column type inference. ...

Last updated: January 22nd, 2025 by shubham.bhusate

NO SUCH CATALOG EXCEPTION error when trying to create row filters

Specify the function location when you use the ALTER TABLE command to apply a row filter on a table. ...

Last updated: January 30th, 2025 by krishnachaithanya.thummala

Time zones converted from a local zone to UTC and back not reverting to original values in Apache Spark and SQL Warehouse

Set spark.sql.datetime.java8API.enabled to true on the cluster....

Last updated: January 30th, 2025 by allan.soares

Regular expression (regex) not filtering as expected when using [:alnum:] and [:digit:] in the SQL query

Use \p{Alnum} or \p{Digit} instead....

Last updated: March 18th, 2025 by Vidhi Khaitan

Trying to perform WRITE over UNION ALL causes error

Use Databricks Runtime 15.4 LTS or above....

Last updated: March 21st, 2025 by MuthuLakshmi.AN

Different results when using rlike with regex in SQL queries vs Spark SQL queries

You must properly escape the backslash character in rlike patterns....

Last updated: April 26th, 2025 by wanderson.oliveira

[FIELDS_ALREADY_EXISTS] error in spark.sql when changing column name capitalization

Spark.sql is not case sensitive by default. Set spark.sql.caseSensitive to true to change the default behavior....

Last updated: April 26th, 2025 by wanderson.oliveira

All rows from all partitions of large table being scanned when a JOIN has been performed using the partition columns in the join key, and execution times are longer

Add static filters in the query to use Apache Spark features Dynamic Partition Pruning (DPP) and Dynamic File pruning (DFP)....

Last updated: April 30th, 2025 by jayant.sharma

Databricks Help Center

Broadcast join exceeds threshold, returns out of memory error

Cannot grow BufferHolder; exceeds size limitation

Date functions only accept int values in Apache Spark 3.0

Disable broadcast when query plan has BroadcastNestedLoopJoin

Duplicate columns in the metadata error

Generate unique increasing numeric values

Error in SQL statement: AnalysisException: Table or view not found

Error when downloading full results after join

Error when running MSCK REPAIR TABLE in parallel

Find the size of a table snapshot

Inner join drops records in result

Data is incorrect when read from Snowflake

JDBC write fails with a PrimaryKeyViolation error

Query does not skip header row on external table

SHOW DATABASES command returns unexpected column name

Cannot view table SerDe properties

Parsing post meridiem time (PM) with to_timestamp() returns null

to_json() results in Cannot use null as map key error

Set nullability when using SaveAsTable with Delta tables

Ensure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above

Using datetime values in Spark 3.0 and above

ANSI compliant DECIMAL precision and scale

Recreate LISTAGG functionality with Spark SQL

Decreased performance when using DELETE with a subquery on Databricks Runtime 10.4 LTS

Automatic VACUUM on write does not work with non-Delta tables

Production environment will not connect to Sybase

LEFT JOIN resulting in null values when joining timestamp column and date column

Handling case sensitivity issues in Delta Lake nested fields

Trailing zeros in decimal values appear when reading Parquet files in Apache Spark

SQL transformations involving timestamp columns giving different results in an interactive cluster versus serverless compute

Job failures when running Apache Spark jobs processing MongoDB data

Job ID column not consistently showing values in the Apache Spark UI for Sub Execution IDs

COPY INTO command failing on partition columns with STRING data types that start with an integer

NO SUCH CATALOG EXCEPTION error when trying to create row filters

Time zones converted from a local zone to UTC and back not reverting to original values in Apache Spark and SQL Warehouse

Regular expression (regex) not filtering as expected when using [:alnum:] and [:digit:] in the SQL query

Trying to perform WRITE over UNION ALL causes error

Different results when using rlike with regex in SQL queries vs Spark SQL queries

[FIELDS_ALREADY_EXISTS] error in spark.sql when changing column name capitalization

All rows from all partitions of large table being scanned when a JOIN has been performed using the partition columns in the join key, and execution times are longer

Round() function not returning the number of decimal places indicated in the parameters

DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION error when querying views of system tables

INSERT OVERWRITE DIRECTORY with Hive format failing with “specified path already exists” error

Contact Us