Updated May 26th, 2022 by mathan.pillai

Get and set Apache Spark configuration properties in a notebook

In most cases, you set the Spark config (AWS | Azure ) at the cluster level. However, there may be instances when you need to check (or set) the values of specific Spark configuration properties in a notebook. This article shows you how to display the current value of a Spark configuration property in a notebook. It also shows you how to set a new v...

0 min reading time
Updated May 23rd, 2022 by mathan.pillai

Find the size of a table

This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...

0 min reading time
Updated May 10th, 2022 by mathan.pillai

Z-Ordering will be ineffective, not collecting stats

Problem You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns. AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns. Info Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more infor...

0 min reading time
Updated May 10th, 2022 by mathan.pillai

Optimize a Delta sink in a structured streaming application

You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster. If your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch. The autoOptimize operation compac...

0 min reading time
Updated May 23rd, 2022 by mathan.pillai

Select files using a pattern match

When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...

1 min reading time
Updated May 23rd, 2022 by mathan.pillai

Intermittent NullPointerException when AQE is enabled

Problem You get an intermittent NullPointerException error when saving your data. Py4JJavaError: An error occurred while calling o2892.save. : java.lang.NullPointerException     at org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin.$anonfun$getMapSizesForReduceId$1(OptimizeSkewedJoin.scala:167)     at org.apache.spark.sql.execution.adaptive....

0 min reading time
Updated May 10th, 2022 by mathan.pillai

Compare two versions of a Delta table

Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table. One common use case is to compare two versions of a Delta table in order to identify what changed. For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP). Identify all differences You can use a SQL SELEC...

0 min reading time
Updated May 16th, 2022 by mathan.pillai

Conda fails to download packages from Anaconda

Problem You are attempting to download packages from the Anaconda repository and get a PackagesNotFoundError error message. This error can occur when using %conda, or %sh conda in notebooks, and when using Conda in an init script. Cause Anaconda Inc. updated the terms of service for repo.anaconda.com and anaconda.org/anaconda. Based on the Anaconda ...

0 min reading time
Updated May 19th, 2022 by mathan.pillai

readStream() is not whitelisted error when running a query

Problem You have table access control (AWS | Azure | GCP) enabled on your cluster. You are trying to run a structured streaming query and get and error message. py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.s...

0 min reading time
Load More