Updated May 10th, 2022 by mathan.pillai

Optimize a Delta sink in a structured streaming application

You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster. If your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch. The autoOptimize operation compac...

0 min reading time
Updated February 3rd, 2023 by mathan.pillai

OPTIMIZE is only supported for Delta tables error on Delta Lake

Problem You run OPTIMIZE on a Delta table and get an error message saying it is only supported on Delta tables. Error: `<database-name>`.`<table-name>`is not a Delta table. OPTIMIZE is only supported for Delta tables. Cause This can happen if the target table's storage location was modified and the table was recreated with a new storage ...

0 min reading time
Updated May 23rd, 2022 by mathan.pillai

Intermittent NullPointerException when AQE is enabled

Problem You get an intermittent NullPointerException error when saving your data. Py4JJavaError: An error occurred while calling o2892.save. : java.lang.NullPointerException     at org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin.$anonfun$getMapSizesForReduceId$1(OptimizeSkewedJoin.scala:167)     at org.apache.spark.sql.execution.adaptive....

0 min reading time
Updated May 16th, 2022 by mathan.pillai

Conda fails to download packages from Anaconda

Problem You are attempting to download packages from the Anaconda repository and get a PackagesNotFoundError error message. This error can occur when using %conda, or %sh conda in notebooks, and when using Conda in an init script. Cause Anaconda Inc. updated the terms of service for repo.anaconda.com and anaconda.org/anaconda. Based on the Anaconda ...

0 min reading time
Updated May 19th, 2022 by mathan.pillai

readStream() is not whitelisted error when running a query

Problem You have table access control (AWS | Azure | GCP) enabled on your cluster. You are trying to run a structured streaming query and get and error message. py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.s...

0 min reading time
Updated September 28th, 2022 by mathan.pillai

Find the size of a table

This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...

0 min reading time
Updated November 7th, 2022 by mathan.pillai

Job fails with ExecutorLostFailure due to “Out of memory” error

Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...

2 min reading time
Updated May 23rd, 2022 by mathan.pillai

Select files using a pattern match

When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...

1 min reading time
Updated November 7th, 2022 by mathan.pillai

Job fails with ExecutorLostFailure because executor is busy

Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...

1 min reading time
Updated February 3rd, 2023 by mathan.pillai

VACUUM best practices on Delta Lake

Why use VACUUM on Delta Lake? VACUUM is used to clean up unused and stale data files that are taking up unnecessary storage space. Removing these files can help reduce storage costs.  When you run VACUUM on a Delta table it removes the following files from the underlying file system: Any data files that are not maintained by Delta Lake Removes stale...

5 min reading time
Updated December 1st, 2023 by mathan.pillai

Get and set Apache Spark configuration properties in a notebook

In most cases, you set the Spark config (AWS | Azure ) at the cluster level. However, there may be instances when you need to check (or set) the values of specific Spark configuration properties in a notebook. This article shows you how to display the current value of a Spark configuration property in a notebook. It also shows you how to set a new v...

0 min reading time
Updated May 10th, 2022 by mathan.pillai

Z-Ordering will be ineffective, not collecting stats

Problem You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns. AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns. Info Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more infor...

0 min reading time
Updated May 10th, 2022 by mathan.pillai

Compare two versions of a Delta table

Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table. One common use case is to compare two versions of a Delta table in order to identify what changed. For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP). Identify all differences You can use a SQL SELEC...

0 min reading time
Load More