Optimize a Delta sink in a structured streaming application
You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster. If your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch. The autoOptimize operation compac...
0 min reading timeOPTIMIZE is only supported for Delta tables error on Delta Lake
Problem You run OPTIMIZE on a Delta table and get an error message saying it is only supported on Delta tables. Error: `<database-name>`.`<table-name>`is not a Delta table. OPTIMIZE is only supported for Delta tables. Cause This can happen if the target table's storage location was modified and the table was recreated with a new storage ...
0 min reading timeIntermittent NullPointerException when AQE is enabled
Problem You get an intermittent NullPointerException error when saving your data. Py4JJavaError: An error occurred while calling o2892.save. : java.lang.NullPointerException at org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin.$anonfun$getMapSizesForReduceId$1(OptimizeSkewedJoin.scala:167) at org.apache.spark.sql.execution.adaptive....
0 min reading timeConda fails to download packages from Anaconda
Problem You are attempting to download packages from the Anaconda repository and get a PackagesNotFoundError error message. This error can occur when using %conda, or %sh conda in notebooks, and when using Conda in an init script. Cause Anaconda Inc. updated the terms of service for repo.anaconda.com and anaconda.org/anaconda. Based on the Anaconda ...
0 min reading timereadStream() is not whitelisted error when running a query
Problem You have table access control (AWS | Azure | GCP) enabled on your cluster. You are trying to run a structured streaming query and get and error message. py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.s...
0 min reading timeFind the size of a table
This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...
0 min reading timeJob fails with ExecutorLostFailure due to “Out of memory” error
Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...
2 min reading timeSelect files using a pattern match
When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...
1 min reading timeJob fails with ExecutorLostFailure because executor is busy
Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...
1 min reading timeVACUUM best practices on Delta Lake
Why use VACUUM on Delta Lake? VACUUM is used to clean up unused and stale data files that are taking up unnecessary storage space. Removing these files can help reduce storage costs. When you run VACUUM on a Delta table it removes the following files from the underlying file system: Any data files that are not maintained by Delta Lake Removes stale...
5 min reading timeGet and set Apache Spark configuration properties in a notebook
In most cases, you set the Spark config (AWS | Azure ) at the cluster level. However, there may be instances when you need to check (or set) the values of specific Spark configuration properties in a notebook. This article shows you how to display the current value of a Spark configuration property in a notebook. It also shows you how to set a new v...
0 min reading timeZ-Ordering will be ineffective, not collecting stats
Problem You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns. AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns. Info Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more infor...
0 min reading timeCompare two versions of a Delta table
Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table. One common use case is to compare two versions of a Delta table in order to identify what changed. For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP). Identify all differences You can use a SQL SELEC...
0 min reading time