Manage the size of Delta tables
Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space. In this article we discuss recommendations that can help you manage the size of your Delta tables. Enable file system ve...
1 min reading timeExplicit path to data or a defined schema required for Auto loader
Info This article applies to Databricks Runtime 9.1 LTS and above. Problem You are using Auto Loader to ingest data for your ELT pipeline when you get an IllegalArgumentException: Please provide the source directory path with option `path` error message. You get this error when you start an Auto Loader job, if either the path to the data or the data...
1 min reading timeHive-style partitions not found on Delta table after enabling column mapping mode
Problem You want to partition your Delta table on the date value. This creates subfolders for each partition, in the root path of the Delta table. For example, date=2023-01-01, date=2023-01-02, etc. You enable Delta Lake column mapping, but when you try to list the subfolders, the names are not what you expect (date=2023-01-01) because those date pa...
0 min reading timeDelete table when underlying S3 bucket is deleted
Problem You are trying to drop or alter a table when you get an error. Error in SQL statement: IOException: Bucket_name … does not exist You can reproduce the error with a DROP TABLE or ALTER TABLE command. %sql DROP TABLE <database-name.table-name>; %sql ALTER TABLE <database-name.table-name> SET LOCATION "<file-system-location>";...
0 min reading timeList all available tables and their source formats in Unity Catalog
You may want to get a list of all the Delta tables and non-Delta tables available in your Unity Catalog instance. You can use these sample SQL queries to get a table names and the corresponding data source format. Instructions Info Make sure you have permission to access Unity Catalog. You will not be able to view information on tables if you don't ...
0 min reading timeStreaming job gets stuck writing to checkpoint
Problem You are monitoring a streaming job, and notice that it appears to get stuck when processing data. When you review the logs, you discover the job gets stuck when writing data to a checkpoint. INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/st...
0 min reading timeJob cluster limits on notebook output
Problem You are running a notebook on a job cluster and you get an error message indicating that the output is too large. The output of the notebook is too large. Cause: rpc response (of 20975548 bytes) exceeds limit of 20971520 bytes Cause This error message can occur in a job cluster whenever the notebook output is greater then 20 MB. If you are u...
0 min reading timeConverting from Parquet to Delta Lake fails
Problem You are attempting to convert a Parquet file to a Delta Lake file. The directory containing the Parquet file contains one or more subdirectories. The conversion fails with the error message: Expecting 0 partition column(s): [], but found 1 partition column(s): [<column_name>] from parsing the file name: <path_to_the_file_location>...
0 min reading timeWrite a DataFrame with missing columns to a Redshift table
Problem When writing to Redshift tables, if the target table has more columns than the source Apache Spark DataFrame you may get a copy error. The COPY failed with error: [Amazon][Amazon Redshift] (1203) Error occurred while trying to execute a query: ERROR: Load into table table-name failed. Check the 'stl_load_errors' system table for details. “12...
0 min reading timeCannot select a Databricks Runtime version when using a Delta Live Tables pipeline
Problem You want to select a specific Databricks Runtime version for use with your Delta Live Tables (DLT) pipeline, but you cannot find an option for it in the UI or the API. Cause Delta Live Tables do not allow you to directly configure the Databricks Runtime version. Delta Live Tables clusters run on a custom version of the Databricks Runtime t...
0 min reading timeSHOW DATABASES command returns unexpected column name
Problem You are using the SHOW DATABASES command and it returns an unexpected column name. Cause The column name returned by the SHOW DATABASES command changed in Databricks Runtime 7.0. Databricks Runtime 6.4 Extended Support and below: SHOW DATABASES returns namespace as the column name. Databricks Runtime 7.0 and above: SHOW DATABASES returns dat...
0 min reading timeDelta Live Tables job fails when using collect()
Problem You are using collect() in your Delta Live Tables (DLT) pipeline code and you get an error. When you review the stack trace, you see a DataFrame.collect error that says the function is going to be deprecated soon. "message": "Notebook:/path/to/your/notebook used `DataFrame.collect` function that will be deprecated soon. Please fix the notebo...
0 min reading timeApache Spark session is null in DBConnect
Problem You are trying to run your code using Databricks Connect ( AWS | Azure | GCP ) when you get a sparkSession is null error message. java.lang.AssertionError: assertion failed: sparkSession is null while trying to executeCollectResult at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(...
1 min reading time