Databricks Knowledge Base

Main Navigation

  • Help Center
  • Documentation
  • Knowledge Base
  • Community
  • Training
  • Feedback

Scala with Apache Spark (GCP)

These articles can help you to use Scala with Apache Spark.

11 Articles in this category

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.

  • Home
  • Google Cloud Platform
  • Scala with Apache Spark (GCP)

Apache Spark UI is not in sync with job

Problem The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank. When you review the driver logs, you see an AsyncEventQueue warning. Logs ===== 20/12/23 21:20:26 ...

Last updated: May 11th, 2022 by chetan.kardekar

Apache Spark job fails with Parquet column cannot be converted error

Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message. The cluster is running Databricks Runtime 7.3 LTS or above. org.apache.spark.SparkException: Task failed while writing rows. Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://buc...

Last updated: May 20th, 2022 by shanmugavel.chandrakasu

Best practice for cache(), count(), and take()

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(),...

Last updated: May 20th, 2022 by ram.sankarasubramanian

Cannot import timestamp_millis or unix_millis

Problem You are trying to import timestamp_millis or unix_millis into a Scala notebook, but get an error message. %scala import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} error: value timestamp_millis is not a member of object org.apache.spark.sql.functions import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} Cau...

Last updated: May 20th, 2022 by saritha.shivakumar

Cannot modify the value of an Apache Spark config

Problem You are trying to SET the value of a Spark config in a notebook and get a Cannot modify the value of a Spark config error. For example: %sql SET spark.serializer=org.apache.spark.serializer.KryoSerializer Error in SQL statement: AnalysisException: Cannot modify the value of a Spark config: spark.serializer; Cause The SET command does not wor...

Last updated: May 20th, 2022 by Adam Pavlacka

Convert nested JSON to a flattened DataFrame

This article shows you how to flatten nested JSON, using only $"column.*" and explode methods. Sample JSON file Pass the sample JSON string to the reader. %scala val json =""" {         "id": "0001",         "type": "donut",         "name": "Cake",         "ppu": 0.55,         "batters":                 {                         "batter":           ...

Last updated: May 20th, 2022 by Adam Pavlacka

Create a DataFrame from a JSON string or Python dictionary

In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...

Last updated: July 1st, 2022 by ram.sankarasubramanian

Decimal$DecimalIsFractional assertion error

Problem You are running a job on Databricks Runtime 7.x or above when you get a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional error message. Example stack trace: java.lang.AssertionError: assertion failed:  Decimal$DecimalIsFractional   while compiling: <notebook>    during phase: globalPhase=terminal, enteringPhase=j...

Last updated: May 23rd, 2022 by saikrishna.pujari

from_json returns null in Apache Spark 3.0

Problem The from_json function is used to parse a JSON string and return a struct of values. For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return. %python from pyspark.sql.functions import col, from_json display(   df.select(col('value'), from_json(c...

Last updated: May 23rd, 2022 by shanmugavel.chandrakasu

Manage the size of Delta tables

Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space. In this article we discuss recommendations that can help you manage the size of your Delta tables. Enable file system ve...

Last updated: May 23rd, 2022 by Jose Gonzalez

Select files using a pattern match

When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...

Last updated: May 23rd, 2022 by mathan.pillai


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use

Definition by Author

0
0