Updated April 20th, 2023 by Jose Gonzalez

Cannot select a Databricks Runtime version when using a Delta Live Tables pipeline

Problem You want to select a specific Databricks Runtime version for use with your Delta Live Tables (DLT) pipeline, but you cannot find an option for it in the UI or the API.  Cause  Delta Live Tables do not allow you to directly configure the Databricks Runtime version. Delta Live Tables clusters run on a custom version of the Databricks Runtime t...

0 min reading time
Updated October 12th, 2022 by Jose Gonzalez

Explicit path to data or a defined schema required for Auto loader

Info This article applies to Databricks Runtime 9.1 LTS and above. Problem You are using Auto Loader to ingest data for your ELT pipeline when you get an IllegalArgumentException: Please provide the source directory path with option `path` error message. You get this error when you start an Auto Loader job, if either the path to the data or the data...

1 min reading time
Updated May 10th, 2022 by Jose Gonzalez

Job cluster limits on notebook output

Problem You are running a notebook on a job cluster and you get an error message indicating that the output is too large. The output of the notebook is too large. Cause: rpc response (of 20975548 bytes) exceeds limit of 20971520 bytes Cause This error message can occur in a job cluster whenever the notebook output is greater then 20 MB. If you are u...

0 min reading time
Updated May 19th, 2022 by Jose Gonzalez

Streaming job gets stuck writing to checkpoint

Problem You are monitoring a streaming job, and notice that it appears to get stuck when processing data. When you review the logs, you discover the job gets stuck when writing data to a checkpoint. INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/st...

0 min reading time
Updated February 21st, 2024 by Jose Gonzalez

Hive-style partitions not found on Delta table after enabling column mapping mode

Problem You want to partition your Delta table on the date value. This creates subfolders for each partition, in the root path of the Delta table. For example, date=2023-01-01, date=2023-01-02, etc. You enable Delta Lake column mapping, but when you try to list the subfolders, the names are not what you expect (date=2023-01-01) because those date pa...

0 min reading time
Updated April 1st, 2022 by Jose Gonzalez

Apache Spark session is null in DBConnect

Problem You are trying to run your code using Databricks Connect ( AWS  |  Azure  |  GCP ) when you get a sparkSession is null error message. java.lang.AssertionError: assertion failed: sparkSession is null while trying to executeCollectResult at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(...

1 min reading time
Updated May 10th, 2023 by Jose Gonzalez

Delta Live Tables job fails when using collect()

Problem You are using collect() in your Delta Live Tables (DLT) pipeline code and you get an error. When you review the stack trace, you see a DataFrame.collect error that says the function is going to be deprecated soon. "message": "Notebook:/path/to/your/notebook used `DataFrame.collect` function that will be deprecated soon. Please fix the notebo...

0 min reading time
Updated May 23rd, 2022 by Jose Gonzalez

Write a DataFrame with missing columns to a Redshift table

Problem When writing to Redshift tables, if the target table has more columns than the source Apache Spark DataFrame you may get a copy error. The COPY failed with error: [Amazon][Amazon Redshift] (1203) Error occurred while trying to execute a query: ERROR: Load into table table-name failed. Check the 'stl_load_errors' system table for details. “12...

0 min reading time
Updated May 10th, 2022 by Jose Gonzalez

Converting from Parquet to Delta Lake fails

Problem You are attempting to convert a Parquet file to a Delta Lake file. The directory containing the Parquet file contains one or more subdirectories. The conversion fails with the error message: Expecting 0 partition column(s): [], but found 1 partition column(s): [<column_name>] from parsing the file name: <path_to_the_file_location>...

0 min reading time
Updated February 22nd, 2024 by Jose Gonzalez

List all available tables and their source formats in Unity Catalog

You may want to get a list of all the Delta tables and non-Delta tables available in your Unity Catalog instance. You can use these sample SQL queries to get a table names and the corresponding data source format. Instructions Info Make sure you have permission to access Unity Catalog. You will not be able to view information on tables if you don't ...

0 min reading time
Updated May 24th, 2022 by Jose Gonzalez

SHOW DATABASES command returns unexpected column name

Problem You are using the SHOW DATABASES command and it returns an unexpected column name. Cause The column name returned by the SHOW DATABASES command changed in Databricks Runtime 7.0. Databricks Runtime 6.4 Extended Support and below: SHOW DATABASES returns namespace as the column name. Databricks Runtime 7.0 and above: SHOW DATABASES returns dat...

0 min reading time
Updated May 23rd, 2022 by Jose Gonzalez

Manage the size of Delta tables

Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space. In this article we discuss recommendations that can help you manage the size of your Delta tables. Enable file system ve...

1 min reading time
Updated May 31st, 2022 by Jose Gonzalez

Delete table when underlying S3 bucket is deleted

Problem You are trying to drop or alter a table when you get an error. Error in SQL statement: IOException: Bucket_name … does not exist You can reproduce the error with a DROP TABLE or ALTER TABLE command. %sql DROP TABLE <database-name.table-name>; %sql ALTER TABLE <database-name.table-name> SET LOCATION "<file-system-location>";...

0 min reading time
Load More