A file referenced in the transaction log cannot be found
Problem Your job fails with an error message: A file referenced in the transaction log cannot be found. Example stack trace: Error in SQL statement: SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 106, XXX.XXX.XXX.XXX, executor 0): com.databricks.sql.io.FileRe...
Cannot delete data using JDBC in Eclipse
Problem You cannot delete data from a Delta table using JDBC from your local Eclipse environment. The same delete operation works when run in a notebook. You can connect to Databricks from your local environment. A select query works. Integer count = namedlocalJdbcTemplate.queryForObject("SELECT COUNT(*) FROM <table-name> ", new MapSqlParamete...
Compare two versions of a Delta table
Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table. One common use case is to compare two versions of a Delta table in order to identify what changed. For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP). Identify all differences You can use a SQL SELEC...
Converting from Parquet to Delta Lake fails
Problem You are attempting to convert a Parquet file to a Delta Lake file. The directory containing the Parquet file contains one or more subdirectories. The conversion fails with the error message: Expecting 0 partition column(s): [], but found 1 partition column(s): [<column_name>] from parsing the file name: <path_to_the_file_location>...
Delta Merge cannot resolve nested field
Problem You are attempting a Delta Merge with automatic schema evolution, but it fails with a Delta Merge: cannot resolve 'field' due to data type mismatch error message. Cause This can happen if you have made changes to the nested column fields. For example, assume we have a column called Address with the fields streetName, houseNumber, and city ne...
Delete your streaming query checkpoint and restart
Problem Your job fails with a Delta table <value> doesn't exist. Please delete your streaming query checkpoint and restart. error message. Cause Two different streaming sources are configured to use the same checkpoint directory. This is not supported. For example, assume streaming query A streams data from Delta table A, and uses the director...
How Delta cache behaves on an autoscaling cluster
This article is about how Delta cache (AWS | Azure | GCP) behaves on an auto-scaling cluster, which removes or adds nodes as needed. When a cluster downscales and terminates nodes: A Delta cache behaves in the same way as an RDD cache. Whenever a node goes down, all of the cached data in that particular node is lost. Delta cache data is not moved fr...
How to improve performance of Delta Lake MERGE INTO queries using partition pruning
This article explains how to trigger partition pruning in Delta Lake MERGE INTO (AWS | Azure | GCP) queries from Databricks. Partition pruning is an optimization technique to limit the number of partitions that are inspected by a query. Discussion MERGE INTO is an expensive operation when used with Delta tables. If you don’t partition the underlying...
Best practices for dropping a managed Delta Lake table
Regardless of how you drop a managed table, it can take a significant amount of time, depending on the data size. Delta Lake managed tables in particular contain a lot of metadata in the form of transaction logs, and they can contain duplicate data files. If a Delta table has been in use for a long time, it can accumulate a very large amount of data...
HIVE_CURSOR_ERROR when reading a table in Athena
Problem You create an external table in Athena and integrate it with Delta Lake using the instructions in the Presto and Athena to Delta Lake integration documentation. The external table in Athena is defined in the Apache Hive metastore. You run a select query on external table from the Athena Query Editor and it returns a HIVE_CURSOR_ERROR. HIVE_C...
Access denied when writing Delta Lake tables to S3
Problem Writing DataFrame contents in Delta Lake format to an S3 location can cause an error: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C827672D85516BA9; S3 Extended Request ID: Cause A write operation involving the Delta Lake format requires permissions...
Delta Lake write job fails with java.lang.UnsupportedOperationException
Problem Delta Lake write jobs sometimes fail with the following exception: java.lang.UnsupportedOperationException: com.databricks.backend.daemon.data.client.DBFSV1.putIfAbsent(path: Path, content: InputStream). DBFS v1 doesn't support transactional writes from multiple clusters. Please upgrade to DBFS v2. Or you can disable multi-cluster writes by ...
How to populate or update columns in an existing Delta table
Problem You have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file. Solution In this example, there is a customers table, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format. Create a DataFrame from th...
Identify duplicate data on append operations
A common issue when performing append operations on Delta tables is duplicate data. For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table. In this article, we review basic troubleshooting steps that you can use to...
Object lock error when writing Delta Lake tables to S3
Problem You are trying to perform a Delta write operation to a S3 bucket and get an error message. com.amazonaws.services.s3.model.AmazonS3Exception: Content-MD5 HTTP header is required for Put Part requests with Object Lock parameters Cause Delta Lake does not support S3 buckets with object lock enabled. Solution You should use an S3 bucket that do...
Optimize a Delta sink in a structured streaming application
You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster. If your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch. The autoOptimize operation compac...
Delta Lake UPDATE query fails with IllegalState exception
Problem When you execute a Delta Lake UPDATE, DELETE, or MERGE query that uses Python UDFs in any of its transformations, it fails with the following exception: AWS java.lang.UnsupportedOperationException: Error in SQL statement: IllegalStateException: File (s3a://xxx/table1) to be rewritten not found among candidate files: s3a://xxx/table1/part-000...
Unable to cast string to varchar
Problem You are trying to cast a string type column to varchar but it isn’t working. Info The varchar data type (AWS | Azure | GCP) is available in Databricks Runtime 8.0 and above. Create a simple Delta table, with one column as type string.%sql CREATE OR REPLACE TABLE delta_table1 (`col1` string) USING DELTA; Use SHOW TABLE on the newly created ta...
Vaccuming with zero retention results in data loss
Problem You add data to a Delta table, but the data disappears without warning. There is no obvious error message. Cause This can happen when spark.databricks.delta.retentionDurationCheck.enabled is set to false and VACUUM is configured to retain 0 hours. %sql VACUUM <name-of-delta-table> RETAIN 0 HOURS When VACUUM is configured to retain 0 ho...
Z-Ordering will be ineffective, not collecting stats
Problem You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns. AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns. Info Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more infor...
Change cluster config for Delta Live Table pipeline
Problem You are using Delta Live Tables and want to change the cluster configuration. You create a pipeline, but only have options to enable or disable Photon and select the number of workers. Cause When you create a Delta Live Table pipeline, most parameters are configured with default values. These values cannot be configured before the pipeline i...