Apache Spark UI is not in sync with job
Problem The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank. When you review the driver logs, you see an AsyncEventQueue warning. Logs ===== 20/12/23 21:20:26 ...
Apache Spark job fails with Parquet column cannot be converted error
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message. The cluster is running Databricks Runtime 7.3 LTS or above. org.apache.spark.SparkException: Task failed while writing rows. Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://buc...
Best practice for cache(), count(), and take()
cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(),...
Cannot import timestamp_millis or unix_millis
Problem You are trying to import timestamp_millis or unix_millis into a Scala notebook, but get an error message. %scala import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} error: value timestamp_millis is not a member of object org.apache.spark.sql.functions import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} Cau...
Cannot modify the value of an Apache Spark config
Problem You are trying to SET the value of a Spark config in a notebook and get a Cannot modify the value of a Spark config error. For example: %sql SET spark.serializer=org.apache.spark.serializer.KryoSerializer Error in SQL statement: AnalysisException: Cannot modify the value of a Spark config: spark.serializer; Cause The SET command does not wor...
Convert nested JSON to a flattened DataFrame
This article shows you how to flatten nested JSON, using only $"column.*" and explode methods. Sample JSON file Pass the sample JSON string to the reader. %scala val json =""" { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": ...
Create a DataFrame from a JSON string or Python dictionary
In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...
Decimal$DecimalIsFractional assertion error
Problem You are running a job on Databricks Runtime 7.x or above when you get a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional error message. Example stack trace: java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: <notebook> during phase: globalPhase=terminal, enteringPhase=j...
from_json returns null in Apache Spark 3.0
Problem The from_json function is used to parse a JSON string and return a struct of values. For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return. %python from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(c...
Manage the size of Delta tables
Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space. In this article we discuss recommendations that can help you manage the size of your Delta tables. Enable file system ve...
Select files using a pattern match
When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...