Updated June 1st, 2022 by Adam Pavlacka

Failure to detect encoding in JSON

Problem Spark job fails with an exception containing the message: Invalid UTF-32 character 0x1414141(above 10ffff)  at char #1, byte #7) At org.apache.spark.sql.catalyst.json.JacksonParser.parse Cause The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. However, BOM is not ...

0 min reading time
Updated August 15th, 2022 by Adam Pavlacka

Troubleshooting JDBC and ODBC connections

DBR Version: <list all applicable DBR versions> Cloud Version: AWS, Azure, GCP Author: <Databricks email of author> Owning Team: <Region + Platform/Spark> Ticket URL: <Link to original Salesforce or Jira ticket> Last reviewed date: May 05, 2021 This article provides information to help you troubleshoot the connection between ...

2 min reading time
Updated March 2nd, 2022 by Adam Pavlacka

Enable OpenJSSE and TLS 1.3

Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is not encrypted. If you require that data is encrypted at all times, you can encrypt traffic between cluster worker nodes using AES 128 over a TLS 1.2 connection. In some cases, you may want to use TLS 1.3 i...

0 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Convert flattened DataFrame to nested JSON

This article explains how to convert a flattened DataFrame to a nested structure, by nesting a case class within another case class. You can use this technique to build a JSON file, that can then be sent to an external API. Define nested schema We’ll start with a flattened DataFrame. Using this example DataFrame, we define a custom nested schema usi...

0 min reading time
Updated May 9th, 2022 by Adam Pavlacka

How to Sort S3 files By Modification Time in Databricks Notebooks

Problem When you use the dbutils utility to list the files in a S3 location, the S3 files list in random order. However, dbutils doesn’t provide any method to sort the files based on their modification time. dbutils doesn’t list a modification time either. Solution Use the Hadoop filesystem API to sort the S3 files, as shown here: %scala import org....

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

How to ensure idempotency for jobs

When you submit jobs through the Databricks Jobs REST API, idempotency is not guaranteed. If the client request is timed out and the client resubmits the same request, you may end up with duplicate jobs running. To ensure job idempotency when you submit jobs through the Jobs API, you can use an idempotency token to define a unique value for a specif...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Persist Apache Spark CSV metrics to a DBFS location

Spark has a configurable metrics system that supports a number of sinks, including CSV files. In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location. Create an init script All of the configuration is done in an init script. The init script does the following thre...

1 min reading time
Updated May 25th, 2022 by Adam Pavlacka

Simplify chained transformations

Sometimes you may need to perform multiple transformations on your DataFrame: %scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrame val testDf = (1 to 10).toDF("col") def func0(x: Int => Int, y: Int)(in: DataFrame): DataFrame = {   in.filter('col > x(y)) } def func1(x: Int)(in: DataFrame): DataFrame = {   in.sele...

1 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Invalid timestamp when loading data into Amazon Redshift

Problem When you use a spark-redshift write operation to save timestamp data to Amazon Redshift, the following error can occur if that timestamp data includes timezone information. Error (code 1206) while loading data into Redshift: "Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SSOF]" Cause The Redshift table is using the Timestamp data typ...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

How to improve performance of Delta Lake MERGE INTO queries using partition pruning

This article explains how to trigger partition pruning in Delta Lake MERGE INTO (AWS | Azure | GCP) queries from Databricks. Partition pruning is an optimization technique to limit the number of partitions that are inspected by a query. Discussion MERGE INTO is an expensive operation when used with Delta tables. If you don’t partition the underlying...

3 min reading time
Updated July 22nd, 2022 by Adam Pavlacka

Cannot read audit logs due to duplicate columns

Problem You are trying to read a udit logs and get an AnalysisException: Found duplicate column(s) error. spark.read.format("json").load("dbfs://mnt/logs/<path-to-logs>/date=2021-12-07") //  AnalysisException: Found duplicate column(s) in the data schema: `<some_column>` Cause From November 2021 to December 2021, a limited number of Data...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

How to configure single-core executors to run JNI libraries

When you create a cluster, Databricks launches one Apache Spark executor instance per worker node, and the executor uses all of the cores on the node. In certain situations, such as if you want to run non-thread-safe JNI libraries, you might need an executor that has only one core or task slot, and does not attempt to run concurrent tasks. In this c...

1 min reading time
Updated July 22nd, 2022 by Adam Pavlacka

Apache Spark UI shows less than total node memory

Problem The Executors tab in the Spark UI shows less memory than is actually available on the node: AWS An m4.xlarge instance (16 GB ram, 4 core) for the driver node, shows 4.5 GB memory on the Executors tab. An m4.large instance (8 GB ram, 2 core) for the driver node, shows 710 MB memory on the Executors tab: Azure An F8s instance (16 GB, 4 core) f...

1 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Delete your streaming query checkpoint and restart

Problem Your job fails with a Delta table <value> doesn't exist. Please delete your streaming query checkpoint and restart. error message. Cause Two different streaming sources are configured to use the same checkpoint directory. This is not supported. For example, assume streaming query A streams data from Delta table A, and uses the director...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Cluster cancels Python command execution after installing Bokeh

Problem The cluster returns Cancelled in a Python notebook. Inspect the driver log (std.err) in the Cluster Configuration page for a stack trace and error message similar to the following: log4j:WARN No appenders could be found for logger (com.databricks.conf.trusted.ProjectConf$). log4j:WARN Please initialize the log4j system properly. log4j:WARN S...

1 min reading time
Updated May 25th, 2022 by Adam Pavlacka

How to dump tables in CSV, JSON, XML, text, or HTML format

You want to send results of your computations in Databricks outside Databricks. You can use BI tools to connect to your cluster via JDBC and export results from the BI tools, or save your tables in DBFS or blob storage and copy the data via REST API. This article introduces JSpark, a simple console tool for executing SQL queries using JDBC on Spark ...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to explore Apache Spark metrics with Spark listeners

Apache Spark provides several useful internal listeners that track metrics about tasks and jobs. During the development cycle, for example, these metrics can help you to understand when and why a task takes a long time to finish. Of course, you can leverage the Spark UI or History UI to see information for each task and stage, but there are some dow...

2 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Change version of R (r-base)

These instructions describe how to install a different version of R (r-base) on a cluster. You can check the default r-base version that each Databricks Runtime version is installed with in the System environment section of each Databricks Runtime release note (AWS | Azure | GCP). List available r-base-core versions To list the versions of r-base-co...

1 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Cluster failed to launch

This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs. Cluster timeout Error messages: Driver failed to start in time INTERNAL_ERROR: The Spark driver failed to start within 300 seconds Cluster failed to be healthy within 200 seconds Cau...

2 min reading time
Updated May 10th, 2022 by Adam Pavlacka

How Delta cache behaves on an autoscaling cluster

This article is about how Delta cache (AWS | Azure | GCP) behaves on an auto-scaling cluster, which removes or adds nodes as needed. When a cluster downscales and terminates nodes: A Delta cache behaves in the same way as an RDD cache. Whenever a node goes down, all of the cached data in that particular node is lost. Delta cache data is not moved fr...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Apache Spark executor memory allocation

By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM) memory heap. This is controlled by the spark.executor.memory property. However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As JVMs scale up in memory size, issues with the garbage collecto...

0 min reading time
Updated February 25th, 2022 by Adam Pavlacka

Troubleshooting Amazon Redshift connection problems

Problem You created a VPC peering connection and configured an Amazon Redshift cluster in the peer network. When you attempt to access the Redshift cluster, you get the following error: Error message: OperationalError: could not connect to server: Connection timed out Cause This problem can occur if: VPC peering is misconfigured. The corresponding p...

2 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Python 2 sunset status

Python.org officially moved Python 2 into EoL (end-of-life) status on January 1, 2020. What does this mean for you? Databricks Runtime 6.0 and above Databricks Runtime 6.0 and above support only Python 3. You cannot create a cluster with Python 2 using these runtimes. Any clusters created with these runtimes use Python 3 by definition. Databricks Ru...

1 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Incompatible schema in some files

Problem The Spark job fails with an exception like the following while reading Parquet files: Error in SQL statement: SparkException: Job aborted due to stage failure: Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0 (TID 868031, 10.111.245.219, executor 31): java.lang.UnsupportedOperationException: org.a...

1 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Rendering an R markdown file containing sparklyr code fails

Problem After you install and configure RStudio in the Databricks environment, when you launch RStudio and click the Knit button to knit a Markdown file that contains code to initialize a sparklyr context, rendering fails with the following error: failed to start sparklyr backend:object 'DATABRICKS_GUID' not found Calls: <Anonymous>… tryCatch ...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

PERMISSION_DENIED error when accessing MLflow experiment artifact

Problem You get a PERMISSION_DENIED error when trying to access an MLflow artifact using the MLflow client. RestException: PERMISSION_DENIED: User <user> does not have permission to 'View' experiment with id <experiment-id> or RestException: PERMISSION_DENIED: User <user> does not have permission to 'Edit' experiment with id <ex...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Notebook autosave fails due to file size limits

Problem Notebook autosaving fails with the following error message: Failed to save revision: Notebook size exceeds limit. This is most commonly caused by cells with large results. Remove some cells or split the notebook. Cause The maximum notebook size allowed for autosaving is 8 MB. Solution First, check the size of your notebook file using your br...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

RocksDB fails to acquire a lock

Problem You are trying to use RocksDB as a state store for your structured streaming application, when you get an error message saying that the instance could not be acquired. Caused by: java.lang.IllegalStateException: RocksDB instance could not be acquired by [ThreadId: 742, task: 140.3 in stage 3152, TID 553193] as it was not released by [ThreadI...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to extract feature information for tree-based Apache SparkML pipeline models

When you are fitting a tree-based model, such as a decision tree, random forest, or gradient boosted tree, it is helpful to be able to review the feature importance levels along with the feature names. Typically models in SparkML are fit as the last stage of the pipeline. To extract the relevant feature information from the pipeline with the tree mo...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

A file referenced in the transaction log cannot be found

Problem Your job fails with an error message: A file referenced in the transaction log cannot be found. Example stack trace: Error in SQL statement: SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 106, XXX.XXX.XXX.XXX, executor 0): com.databricks.sql.io.FileRe...

1 min reading time
Updated June 1st, 2022 by Adam Pavlacka

Unable to read files and list directories in a WASB filesystem

Problem When you try reading a file on WASB with Spark, you get the following exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 19, 10.139.64.5, executor 0): shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.a...

1 min reading time
Updated June 1st, 2022 by Adam Pavlacka

Accessing Redshift fails with NullPointerException

Problem Sometimes when you read a Redshift table: %scala val original_df = spark.read.       format("com.databricks.spark.redshift").       option("url", url).       option("user", user).       option("password", password).       option("query", query).       option("forward_spark_s3_credentials", true).       option("tempdir", "path").       load()...

1 min reading time
Updated May 20th, 2022 by Adam Pavlacka

How to parallelize R code with spark.lapply

Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed com...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

AnalysisException when dropping table on Azure-backed metastore

Problem When you try to drop a table in an external Hive version 2.0 or 2.1 metastore that is deployed on Azure SQL Database, Databricks throws the following exception: com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(...

0 min reading time
Updated May 23rd, 2022 by Adam Pavlacka

Cannot grow BufferHolder; exceeds size limitation

Problem Your Apache Spark job fails with an IllegalArgumentException: Cannot grow BufferHolder error. java.lang.IllegalArgumentException: Cannot grow BufferHolder by size XXXXXXXXX because the size after growing exceeds size limitation 2147483632 Cause BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB). If a column value exceed...

0 min reading time
Updated June 1st, 2022 by Adam Pavlacka

Optimize read performance from JDBC data sources

Problem Reading data from an external JDBC database is slow. How can I improve read performance? Solution See the detailed discussion in the Databricks documentation on how to optimize performance when reading data (AWS | Azure | GCP) from an external JDBC database....

0 min reading time
Updated May 18th, 2022 by Adam Pavlacka

How to set up Apache Kafka on Databricks

This article explains how to set up Apache Kafka on AWS EC2 machines and connect them with Databricks. Following are the high level steps that are required to create a Kafka cluster and connect from Databricks notebooks. Step 1: Create a new VPC in AWS When creating the new VPC, set the new VPC CIDR range different than the Databricks VPC CIDR range...

1 min reading time
Updated June 1st, 2022 by Adam Pavlacka

CosmosDB-Spark connector library conflict

This article explains how to resolve an issue running applications that use the CosmosDB-Spark connector in the Databricks environment. Problem Normally if you add a Maven dependency to your Spark cluster, your app should be able to use the required connector libraries. But currently, if you simply specify the CosmosDB-Spark connector’s Maven co-ord...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Convert Python datetime object to string

There are multiple ways to display date and time values with Python, however not all of them are easy to read. For example, when you collect a timestamp column from a DataFrame and save it as a Python variable, the value is stored as a datetime object. If you are not familiar with the datetime object format, it is not as easy to read as the common Y...

1 min reading time
Updated May 23rd, 2022 by Adam Pavlacka

Date functions only accept int values in Apache Spark 3.0

Problem You are attempting to use the date_add() or date_sub() functions in Spark 3.0, but they are returning an Error in SQL statement: AnalysisException error message. In Spark 2.4 and below, both functions work as normal. %sql select date_add(cast('1964-05-23' as date), '12.34') Cause You are attempting to use a fractional or string value as the ...

0 min reading time
Updated May 18th, 2022 by Adam Pavlacka

How to switch a SNS streaming job to a new SQS queue

Problem You have a Structured Streaming job running via the S3-SQS connector. Suppose you want to recreate the source SQS, backed by SNS data, and you want to proceed with a new queue to be processed in the same job and in the same output directory. Solution Use the following procedure: Create new SQS queues and subscribe to s3-events (from SNS). At...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Multi-part upload failure

Problem You observe a job failure with the exception: com.amazonaws.SdkClientException: Unable to complete multi-part upload. Individual part upload failed : Unable to execute HTTP request: Timeout waiting for connection from pool org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool ... com.amazonaws.http.Ama...

1 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Cannot run notebook commands after canceling streaming cell

Problem After you cancel a running streaming cell in a notebook attached to a Databricks Runtime 5.0 cluster, you cannot run any subsequent commands in the notebook. The commands are left in the “waiting to run” state, and you must clear the notebook’s state or detach and reattach the cluster before you can successfully run commands on the notebook....

1 min reading time
Updated May 20th, 2022 by Adam Pavlacka

How to parallelize R code with gapply

Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed com...

1 min reading time
Updated February 25th, 2022 by Adam Pavlacka

Unable to mount Azure Data Lake Storage Gen1 account

Problem When you try to mount an Azure Data Lake Storage (ADLS) Gen1 account on Databricks, it fails with the error: com.microsoft.azure.datalake.store.ADLException: Error creating directory / Error fetching access token Operation null failed with exception java.io.IOException : Server returned HTTP response code: 401 for URL: https://login.windows....

0 min reading time
Updated February 25th, 2022 by Adam Pavlacka

How to analyze user interface performance issues

Problem The Databricks user interface seems to be running slowly. Cause User interface performance issues typically occur due to network latency or a database query taking more time than expected. In order to troubleshoot this type of problem, you need to collect network logs and analyze them to see which network traffic is affected. In most cases, ...

1 min reading time
Updated July 22nd, 2022 by Adam Pavlacka

S3 part number must be between 1 and 10000 inclusive

Problem When you copy a large file from the local file system to DBFS on S3, the following exception can occur: Amazon.S3.AmazonS3Exception: Part number must be an integer between 1 and 10000, inclusive Cause This is an S3 limit on segment count. Part files can only be numbered from 1 to 10000, inclusive. Solution To prevent this exception from occu...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

How to populate or update columns in an existing Delta table

Problem You have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file. Solution In this example, there is a customers table, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format. Create a DataFrame from th...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Create a cluster with Conda

Conda is a popular open source package management system for the Anaconda repo. Databricks Runtime for Machine Learning (Databricks Runtime ML) uses Conda to manage Python library dependencies. If you want to use Conda, you should use Databricks Runtime ML. Attempting to install Anaconda or Conda for use with Databricks Runtime is not supported. Fol...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Library unavailability causing job failures

Problem You are launching jobs that import external libraries and get an Import Error. When a job causes a node to restart, the job fails with the following error message: ImportError: No module named XXX Cause The Cluster Manager is part of the Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R...

1 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Enable GCM cipher suites

Databricks clusters do not have GCM (Galois/Counter Mode) cipher suites enabled by default. You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher suites. Verify required cipher suites Use the nmap utility to verify which cipher suites are required by the external server. %sh nmap --script ssl-enu...

1 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to use Apache Spark metrics

This article gives an example of how to monitor Apache Spark components using the Spark configurable metrics system. Specifically, it shows how to set a new source and enable a sink. For detailed information about the Spark components available for metrics collection, including sinks supported out of the box, follow the documentation link above. Inf...

0 min reading time
Updated May 23rd, 2022 by Adam Pavlacka

Running C++ code in Scala

Run C++ from Scala notebook Review the Run C++ from Scala notebook....

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Listing table names

Problem To fetch all the table names from metastore you can use either spark.catalog.listTables() or %sql show tables. If you observe the duration to fetch the details you can see spark.catalog.listTables() usually takes longer than %sql show tables. Cause spark.catalog.listTables() tries to fetch every table’s metadata first and then show the reque...

0 min reading time
Updated August 23rd, 2022 by Adam Pavlacka

Configure a cluster to use a custom NTP server

By default Databricks clusters use public NTP servers. This is sufficient for most use cases, however you can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a private NTP server under your control. A common use case is to minimize the amount of Internet traffic from your cluster. Update the NT...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Set executor log level

Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Cluster cancels Python command execution due to library conflict

Problem The cluster returns Cancelled in a Python notebook. Notebooks in all other languages execute successfully on the same cluster. Cause When you install a conflicting version of a library, such as ipython, ipywidgets, numpy, scipy, or pandas to the PYTHONPATH, then the Python REPL can break, causing all commands to return Cancelled after 30 sec...

1 min reading time
Updated June 1st, 2022 by Adam Pavlacka

Troubleshooting JDBC/ODBC access to Azure Data Lake Storage Gen2

Problem Info In general, you should use Databricks Runtime 5.2 and above, which include a built-in Azure Blob File System (ABFS) driver, when you want to access Azure Data Lake Storage Gen2 (ADLS Gen2). This article applies to users who are accessing ADLS Gen2 storage using JDBC/ODBC instead. When you run a SQL query from a JDBC or ODBC client to ac...

1 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Cannot use IAM roles with table ACL

Problem You want to use IAM roles when table ACLs are enabled, but you get an error saying credentials cannot be located. NoCredentialsError: Unable to locate credentials Cause When a table ACL is enabled, access to the EC2 instance metadata service is blocked. This is a security measure that prevents users from obtaining IAM access credentials. Sol...

0 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Install rJava and RJDBC libraries

This article explains how to install rJava and RJBDC libraries. Problem When you install rJava and RJDBC libraries with the following command in a notebook cell: %r install.packages(c("rJava", "RJDBC")) You observe the following error: ERROR: configuration failed for package 'rJava' Cause The rJava and RJDBC packages check for Java dependencies and ...

0 min reading time
Updated February 25th, 2022 by Adam Pavlacka

Access denied when writing logs to an S3 bucket

Problem When you try to write log files to an S3 bucket, you get the error: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 2F8D8A07CD8817EA), S3 Extended Request ID: Cause The DBFS mount is in an S3 bucket that assumes roles and uses sse-kms encryption. Th...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Job fails due to job rate limit

Problem A Databricks notebook or Jobs API request returns the following error: Error : {"error_code":"INVALID_STATE","message":"There were already 1000 jobs created in past 3600 seconds, exceeding rate limit: 1000 job creations per 3600 seconds."} Cause This error occurs because the number of jobs per hour exceeds the limit of 1000 established by Da...

0 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Table creation fails with security exception

Problem You attempt to create a table using a cluster that has Table ACLs enabled, but the following error occurs: Error in SQL statement: SecurityException: User does not have permission SELECT on any file. Cause This error occurs on a Table ACL-enabled cluster if you are not an administrator and you do not have sufficient privileges to create a ta...

1 min reading time
Updated February 25th, 2022 by Adam Pavlacka

SSO server redirects to original URL, not to vanity Databricks URL

Problem When you log into Databricks using a vanity URL (such as mycompany.cloud.databricks.com ), you are redirected to a single sign-on (SSO) server for authentication. When that server redirects you back to the Databricks website, the URL changes from the vanity URL to the original deployment URL (such as dbc-XXXX.cloud.databricks.com ). This can...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Append to a DataFrame

To append to a DataFrame, use the union method. %scala val firstDF = spark.range(3).toDF("myCol") val newRow = Seq(20) val appended = firstDF.union(newRow.toDF()) display(appended) %python firstDF = spark.range(3).toDF("myCol") newRow = spark.createDataFrame([[20]]) appended = firstDF.union(newRow) display(appended)...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to perform group K-fold cross validation with Apache Spark

Cross validation randomly splits the training data into a specified number of folds. To prevent data leakage where the same data shows up in multiple folds you can use groups. scikit-learn supports group K-fold cross validation to ensure that the folds are distinct and non-overlapping. On Spark you can use the spark-sklearn library, which distribute...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Databricks job fails because library is not installed

Problem A Databricks job fails because the job requires a library that is not yet installed, causing Import errors. Cause The error occurs because the job starts running before required libraries install. If you run a job on a cluster in either of the following situations, the cluster can experience a delay in installing libraries: When you start an...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Error when installing pyodbc on a cluster

Problem One of the following errors occurs when you use pip to install the pyodbc library. java.lang.RuntimeException: Installation failed with message: Collecting pyodbc "Library installation is failing due to missing dependencies. sasl and thrift_sasl are optional dependencies for SASL or Kerberos support" Cause Although sasl and thrift_sasl are o...

1 min reading time
Updated May 23rd, 2022 by Adam Pavlacka

Error in SQL statement: AnalysisException: Table or view not found

Problem When you try to query a table or view, you get this error: AnalysisException:Table or view not found when trying to query a global temp view Cause You typically create global temp views so they can be accessed from different sessions and kept alive until the application ends. You can create a global temp view with the following statement: %s...

0 min reading time
Updated February 25th, 2022 by Adam Pavlacka

How to discover who deleted a workspace in Azure portal

If your workspace has disappeared or been deleted, you can identify which user deleted it by checking the Activity log in the Azure portal. Go to the Activity log in the Azure portal. Expand the timeline to focus on when the workspace was deleted. Filter the log for a record of the specific event. Click on the event to display information about the ...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

How to correctly update a Maven library in Databricks

Problem You make a minor update to a library in the repository, but you don’t want to change the version number because it is a small change for testing purposes. When you attach the library to your cluster again, your code changes are not included in the library. Cause One strength of Databricks is the ability to install third-party or custom libra...

0 min reading time
Updated May 20th, 2022 by Adam Pavlacka

How to persist and share code in RStudio

Problem Unlike a Databricks notebook that has version control built in, code developed in RStudio is lost when the high concurrency cluster hosting Rstudio is shut down. Solution To persist and share code in RStudio, do one of the following: From RStudio, save the code to a folder on DBFS which is accessible from both Databricks notebooks and RStudi...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Task deserialization time is high

Problem Your tasks are running slower than expected. You review the stage details in the Spark UI on your cluster and see that task deserialization time is high. Cause Cluster-installed libraries (AWS | Azure | GCP) are only installed on the driver when the cluster is started. These libraries are only installed on the executors when the first tasks ...

0 min reading time
Updated February 25th, 2022 by Adam Pavlacka

Vulnerability scan shows vulnerabilities in Databricks EC2 instances

Problem The Corporate Information Security (CIS) Vulnerability Management team identifies vulnerabilities in AWS instances that are traced to EC2 instances created by Databricks (worker AMI). Cause The Databricks security team addresses all critical vulnerabilities and updates the core and worker AMIs on a regular basis. However, if there are long-r...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Job fails when using Spark-Avro to write decimal values to AWS Redshift

Problem In Databricks Runtime versions 5.x and above, when writing decimals to Amazon Redshift using Spark-Avro as the default temp file format, either the write operation fails with the exception: Error (code 1207) while loading data into Redshift: "Invalid digit, Value '"', Pos 0, Type: Decimal" or the write operation writes nulls in place of the ...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Job fails due to cluster manager core instance request limit

Problem A Databricks Notebook or Job API returns the following error: Unexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was rejected due to API rate limit. Please retry your request later, or choose a larger node type instead. Cause The error indicates the Cluster Manager Service core instance requ...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Unexpected cluster termination

Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for remediation. Databricks ini...

3 min reading time
Updated May 31st, 2022 by Adam Pavlacka

How to handle corrupted Parquet files with different schema

Problem Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. You want to read only those files that match a specific schema and skip the files that don’t match. One solution could be to read the files in sequence, identify the schema, and union the DataFrames together. However, this approach ...

0 min reading time
Updated March 8th, 2022 by Adam Pavlacka

Cannot access objects written by Databricks from outside Databricks

Problem When you attempt to access an object in an S3 location written by Databricks using the AWS CLI, the following error occurs: ubuntu@0213-174944-clean111-10-93-15-150:~$ aws s3 cp s3://<bucket>/<location>/0/delta/sandbox/deileringDemo__m2/_delta_log/00000000000000000000.json . fatal error: An error occurred (403) when calling the H...

1 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Job failure due to Azure Data Lake Storage (ADLS) CREATE limits

Problem When you run a job that involves creating files in Azure Data Lake Storage (ADLS), either Gen1 or Gen2, the following exception occurs: Caused by: java.io.IOException: CREATE failed with error 0x83090c25 (Files and folders are being created at too high a rate). [745c5836-264e-470c-9c90-c605f1c100f5] failed with error 0x83090c25 (Files and fo...

0 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Forbidden error while accessing S3 data

Problem While trying to access S3 data using DBFS mount or directly in Spark APIs, the command fails with an exception similar to the following: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; Request ID: XXXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXX, Cloud Provider: AWS, Instance ID: XXXXXXXXXX (Service: Amazon S3; Status Co...

1 min reading time
Updated February 25th, 2022 by Adam Pavlacka

Configure custom DNS settings using dnsmasq

dnsmasq is a tool for installing and configuring DNS routing rules for cluster nodes. You can use it to set up routing between your Databricks environment and your on-premise network. Warning If you use your own DNS server and it goes down, you will experience an outage and will not be able to create clusters. Use the following cluster-scoped init s...

1 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Failed to expand the EBS volume

Problem Databricks jobs fail, due to a lack of space on the disk, even though storage auto-scaling is enabled. When you review the cluster event log, you see a message stating that the instance failed to expand disk due to an authorization error. Instance i-xxxxxxxxx failed to expand disk because: You are not authorized to perform this operation. En...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

How to improve performance with bucketing

Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and s...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Errors when accessing MLflow artifacts without using the MLflow client

MLflow experiment permissions (AWS | Azure) are now enforced on artifacts in MLflow Tracking, enabling you to easily control access to your datasets, models, and other files. Invalid mount exception Problem When trying to access an MLflow run artifact using Databricks File System (DBFS) commands, such as dbutils.fs, you get the following error: com....

0 min reading time
Updated March 2nd, 2022 by Adam Pavlacka

How to calculate the number of cores in a cluster

You can view the number of cores in a Databricks cluster in the Workspace UI using the Metrics tab on the cluster details page. Note Azure Databricks cluster nodes must have a metrics service installed. If the driver and executors are of the same node type, you can also determine the number of cores available in a cluster programmatically, using Sca...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

How to delete all jobs using the REST API

Run the following commands to delete all jobs in a Databricks workspace. Identify the jobs to delete and list them in a text file:%sh curl -X GET -u "Bearer: <token>" https://<databricks-instance>/api/2.0/jobs/list | grep -o -P 'job_id.{0,6}' | awk -F':' '{print $2}' >> job_id.txt Run the curlcommand in a loop to delete the identif...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Fitting an Apache SparkML model throws error

Problem Databricks throws an error when fitting a SparkML model or Pipeline: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168, 10.205.250.130, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfu...

0 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Apache Spark DStream is not supported

Problem You are attempting to use a Spark Discretized Stream (DStream) in a Databricks streaming job, but the job is failing. Cause DStreams and the DStream API are not supported by Databricks. Solution Instead of using Spark DStream, you should migrate to Structured Streaming. Review the Databricks Structured Streaming in production (AWS | Azure | ...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Admin user cannot restart cluster to run job

Problem When a user who has permission to start a cluster, such as a Databricks Admin user, submits a job that is owned by a different user, the job fails with the following message: Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The error received from the cluster manager was: 'You are no...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Failure when mounting or accessing Azure Blob storage

Problem When you try to access an already created mount point or create a new mount point, it fails with the error: WASB: Fails with java.lang.NullPointerException Cause This error can occur when the root mount path (such as /mnt/) is also mounted to blob storage. Run the following command to check if the root path is also mounted: %python dbutils.f...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Generate schema from case class

Spark provides an easy way to generate a schema from a Scala case class. For case class A, use the method ScalaReflection.schemaFor[A].dataType.asInstanceOf[StructType]. For example: %scala import org.apache.spark.sql.types.StructType import org.apache.spark.sql.catalyst.ScalaReflection case class A(key: String, time: java.sql.Timestamp, date: java....

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

EBS leaked volumes

Problem After a cluster is terminated on AWS, some EBS volumes are not deleted automatically. These stray, unattached EBS volumes are often referred to as “leaked” volumes. Cause Databricks always sets DeletionOnTermination=true for the EBS volumes it creates when it launches clusters. Therefore, whenever a cluster instance is terminated, AWS should...

0 min reading time
Updated June 1st, 2022 by Adam Pavlacka

Redshift JDBC driver conflict issue

Problem If you attach multiple Redshift JDBC drivers to a cluster, and use the Redshift connector, the notebook REPL might hang or crash with a SQLDriverWrapper error message. 19/11/14 01:01:44 ERROR SQLDriverWrapper: Fatal non-user error thrown in ReplId-9d455-9b970-b2042 java.lang.NoSuchFieldError: PG_SUBPROTOCOL_NAMES         at com.amazon.redshi...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Delta Lake write job fails with java.lang.UnsupportedOperationException

Problem Delta Lake write jobs sometimes fail with the following exception: java.lang.UnsupportedOperationException: com.databricks.backend.daemon.data.client.DBFSV1.putIfAbsent(path: Path, content: InputStream). DBFS v1 doesn't support transactional writes from multiple clusters. Please upgrade to DBFS v2. Or you can disable multi-cluster writes by ...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Increase the number of tasks per stage

When using the spark-xml package, you can increase the number of tasks per stage by changing the configuration setting spark.hadoop.mapred.max.split.size to a lower value in the cluster’s Spark config (AWS | Azure ). This configuration setting controls the input block size. When data is read from DBFS, it is divided into input blocks, which are then...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Apache Spark job doesn’t start

Problem No Spark jobs start, and the driver logs contain the following error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Cause This error can occur when the executor memory and number of executor cores are set explicitly on the Spark Config tab. Here is a samp...

1 min reading time
Updated May 18th, 2022 by Adam Pavlacka

Handling partition column values while using an SQS queue as a streaming source

Problem If data in S3 is stored by partition, the partition column values are used to name folders in the source directory structure. However, if you use an SQS queue as a streaming source, the S3-SQS source cannot detect the partition column values. For example, if you save the following DataFrame to S3 in JSON format: %scala val df = spark.range(1...

0 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Streaming with File Sink: Problems with recovery if you change checkpoint or output directories

When you stream data into a file sink, you should always change both checkpoint and output directories together. Otherwise, you can get failures or unexpected outputs. Apache Spark creates a folder inside the output directory named _spark_metadata. This folder contains write-ahead logs for every batch run. This is how Spark gets exactly-once guarant...

0 min reading time
Updated June 1st, 2022 by Adam Pavlacka

ABFS client hangs if incorrect client ID or wrong path used

Problem You are using Azure Data Lake Storage (ADLS) Gen2. When you try to access an Azure Blob File System (ABFS) path from a Databricks cluster, the command hangs. Enable the debug log and you can see the following stack trace in the driver logs: Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: https://login.microso...

1 min reading time
Updated July 7th, 2022 by Adam Pavlacka

Operation not supported during append

Problem You are attempting to append data to a file saved on an external storage mount point and are getting an error message: OSError: [Errno 95] Operation not supported. The error occurs when trying to append to a file from both Python and R. Cause Direct appends and random writes are not supported in FUSE v2, which is available in Databricks Runt...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

OSError when accessing MLflow experiment artifacts

Problem You get an OSError: No such file or directory error message when trying to download or log artifacts using one of the following: MlflowClient.download_artifacts() mlflow.[flavor].log_model() mlflow.[flavor].load_model() mlflow.log_artifacts() OSError: No such file or directory: '/dbfs/databricks/mlflow-tracking/<experiment-id>/<run-...

0 min reading time
Updated June 1st, 2022 by Adam Pavlacka

Apache Spark JDBC datasource query option doesn’t work for Oracle database

Problem When you use the query option with the Apache Spark JDBC datasource to connect to an Oracle Database, it fails with this error: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character For example, if you run the following to make a JDBC connection: %scala val df = spark.read   .format("jdbc")   .option("url", "<url>")   .option(...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Behavior of the randomSplit method

When using randomSplit on a DataFrame, you could potentially observe inconsistent behavior. Here is an example: %python df = spark.read.format('inconsistent_data_source').load() a,b = df.randomSplit([0.5, 0.5]) a.join(broadcast(b), on='id', how='inner').count() Typically this query returns 0. However, depending on the underlying data source or input...

0 min reading time
Updated May 9th, 2022 by Adam Pavlacka

Invalid Access Token error when running jobs with Airflow

Problem When you run scheduled Airflow Databricks jobs, you get this error: Invalid Access Token : 403 Forbidden Error Cause To run or schedule Databricks jobs through Airflow, you need to configure the Databricks connection using the Airflow web UI. Any of the following incorrect settings can cause the error: Set the host field to the Databricks wo...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Serialized task is too large

If you see the follow error message, you may be able to fix this error by changing the Spark config (AWS | Azure ) when you start the cluster. Serialized task XXX:XXX was XXX bytes, which exceeds max allowed: spark.rpc.message.maxSize (XXX bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values. To change ...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Identify less used jobs

The workspace has a limit on the number of jobs that can be shown in the UI. The current job limit is 1000. If you exceed the job limit, you receive a QUOTA_EXCEEDED error message. 'error_code':'QUOTA_EXCEEDED','message':'The quota for the number of jobs has been reached. The current quota is 1000. This quota is only applied to jobs created through ...

1 min reading time
Updated March 4th, 2022 by Adam Pavlacka

How to handle blob data contained in an XML file

If you log events in XML format, then every XML event is recorded as a base64 string. In order to run analytics on this data using Apache Spark, you need to use the spark_xml library and the BASE64DECODER API to transform the data for analysis. Problem You need to analyze base64-encoded strings from an XML-formatted log file using Spark. For example...

1 min reading time
Updated July 1st, 2022 by Adam Pavlacka

Apache Spark jobs fail with Environment directory not found error

Problem After you install a Python library (via the cluster UI or by using pip), your Apache Spark jobs fail with an Environment directory not found error message. org.apache.spark.SparkException: Environment directory not found at /local_disk0/.ephemeral_nfs/cluster_libraries/python Cause Libraries are installed on a Network File System (NFS) on th...

0 min reading time
Updated May 23rd, 2022 by Adam Pavlacka

Disable broadcast when query plan has BroadcastNestedLoopJoin

This article explains how to disable broadcast when the query plan has BroadcastNestedLoopJoin in the physical plan. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error. This behavior is...

1 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to set up an embedded Apache Hive metastore

You can set up a Databricks cluster to use an embedded metastore. You can use an embedded metastore when you only need to retain table metadata during the life of the cluster. If the cluster is restarted, the metadata is lost. If you need to persist the table metadata or other data after a cluster restart, then you should use the default metastore o...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Stream XML files using an auto-loader

Apache Spark does not include a streaming API for XML files. However, you can combine the auto-loader features of the Spark batch API with the OSS library, Spark-XML, to stream XML files. In this article, we present a Scala based solution that parses XML data using an auto-loader. Install Spark-XML library You must install the Spark-XML OSS library ...

1 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Best practices for dropping a managed Delta Lake table

Regardless of how you drop a managed table, it can take a significant amount of time, depending on the data size. Delta Lake managed tables in particular contain a lot of metadata in the form of transaction logs, and they can contain duplicate data files. If a Delta table has been in use for a long time, it can accumulate a very large amount of data...

0 min reading time
Updated May 17th, 2022 by Adam Pavlacka

How to send email or SMS messages from Databricks notebooks

You may need to send a notification to a set of recipients from a Databricks notebook. For example, you may want to send email based on matching business rules or based on a command’s success or failure. This article describes two approaches to sending email or SMS messages from a notebook. Both examples use Python notebooks: Send email or SMS messa...

1 min reading time
Updated March 8th, 2022 by Adam Pavlacka

How to calculate the Databricks file system (DBFS) S3 API call cost

The cost of a DBFS S3 bucket is primarily driven by the number of API calls, and secondarily by the cost of storage. You can use the AWS CloudTrail logs to create a table, count the number of API calls, and thereby calculate the exact cost of the API requests. Obtain the following information. You may need to contact your AWS Administrator to get it...

1 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Distinguish active and dead jobs

Problem On clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages. Cause Whenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal eventListenerBus...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Job fails with atypical errors message

Problem Your job run fails with a throttled due to observing atypical errors error message. Cluster became unreachable during run Cause: xxx-xxxxxx-xxxxxxx is throttled due to observing atypical errors Cause The jobs on this cluster have returned too many large results to the Apache Spark driver node. As a result, the chauffeur service runs out of m...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Cannot uninstall library from UI

Problem Usually, libraries can be uninstalled in the Clusters UI. If the checkbox to select the library is disabled, then it’s not possible to uninstall the library from the UI. Cause If you create a library using REST API version 1.2 and if auto-attach is enabled, the library is installed on all clusters. In this scenario, the Clusters UI checkbox ...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Experiment warning when custom artifact storage location is used

Problem When you create an MLflow experiment with a custom artifact location, you get the following warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. MLflow cannot guarantee the enforcement of access controls on ar...

0 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Cannot modify the value of an Apache Spark config

Problem You are trying to SET the value of a Spark config in a notebook and get a Cannot modify the value of a Spark config error. For example: %sql SET spark.serializer=org.apache.spark.serializer.KryoSerializer Error in SQL statement: AnalysisException: Cannot modify the value of a Spark config: spark.serializer; Cause The SET command does not wor...

0 min reading time
Updated May 23rd, 2022 by Adam Pavlacka

Multiple Apache Spark JAR jobs fail when run concurrently

Problem If you run multiple Apache Spark JAR jobs concurrently, some of the runs might fail with the error: org.apache.spark.sql.AnalysisException: Table or view not found: xxxxxxx; line 1 pos 48 Cause This error occurs due to a bug in Scala. When an object extends App, its val fields are no longer immutable and they can be changed when the main met...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Japanese character support in external metastore

Problem You are trying to use Japanese characters in your tables, but keep getting errors. Create a table with the OPTIONS keyword OPTIONS provides extra metadata to the table. You try creating a table with OPTIONS and specify the charset as utf8mb4. %sql CREATE TABLE default.JPN_COLUMN_NAMES('作成年月' string ,'計上年月' string ,'所属コード' string ,'生保代理店コード_8...

1 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Remove Log4j 1.x JMSAppender and SocketServer classes from classpath

Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. Databricks also does not use the affected classes from Log4j 1.x with known vul...

2 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Monitor running jobs with a Job Run dashboard

The Job Run dashboard is a notebook that displays information about all of the jobs currently running in your workspace. To configure the dashboard, you must have permission to attach a notebook to an all-purpose cluster in the workspace you want to monitor. If an all-purpose cluster does not exist, you must have permission to create one. Once the d...

1 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Experiment warning when legacy artifact storage location is used

Problem A new icon appears on the MLflow Experiments page with the following open access warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. In MLflow 1.11 and above, new experiments store artifacts in an MLflow-mana...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

How to save Plotly files and display From DBFS

You can save a chart generated with Plotly to the driver node as a jpg or png file. Then, you can display it in a notebook by using the displayHTML() method. By default, you save Plotly charts to the /databricks/driver/ directory on the driver node in your cluster. Use the following procedure to display the charts at a later time. Generate a sample ...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Apache Spark Jobs hang due to non-deterministic custom UDF

Problem Sometimes Apache Spark jobs hang indefinitely due to the non-deterministic behavior of a Spark User-Defined Function (UDF). Here is an example of such a function: %scala val convertorUDF = (commentCol: String) =>     {               #UDF definition     } val translateColumn = udf(convertorUDF) If you call this UDF using the withColumn() A...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to create table DDLs to import into an external metastore

Databricks supports using external metastores instead of the default Hive metastore. You can export all table metadata from Hive to the external metastore. Use the Apache Spark Catalog API to list the tables in the databases contained in the metastore. Use the SHOW CREATE TABLE statement to generate the DDLs and store them in a file. Use the file to...

0 min reading time
Updated May 18th, 2022 by Adam Pavlacka

Get the path of files consumed by Auto Loader

When you process streaming files with Auto Loader (AWS | Azure | GCP), events are logged based on the files created in the underlying storage. This article shows you how to add the file path for every filename to a new column in the output DataFrame. One use case for this is auditing. When files are ingested to a partitioned folder structure there i...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Common errors in notebooks

There are some common issues that occur when using notebooks. This section outlines some of the frequently asked questions and best practices that you should follow. Spark job fails with java.lang.NoClassDefFoundError Sometimes you may come across an error like: %scala java.lang.NoClassDefFoundError: Could not initialize class line.....$read$ This c...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Reading large DBFS-mounted files using Python APIs

This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs. Problem If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error: /databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextRead...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to check if a spark property is modifiable in a notebook

Problem You can tune applications by setting various configurations. Some configurations must be set at the cluster level, whereas some are set inside notebooks or applications. Solution To check if a particular Spark configuration can be set in a notebook, run the following command in a notebook cell: %scala spark.conf.isModifiable("spark.databrick...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Python command execution fails with AttributeError

This article can help you resolve scenarios in which Python command execution fails with an AttributeError. Problem: 'tuple' object has no attribute 'type' When you run a notebook, Python command execution fails with the following error and stack trace: AttributeError: 'tuple' object has no attribute 'type' Traceback (most recent call last): File "/...

3 min reading time
Updated May 18th, 2022 by Adam Pavlacka

How to restart a structured streaming query from last written offset

Scenario You have a stream, running a windowed aggregation query, that reads from Apache Kafka and writes files in Append mode. You want to upgrade the application and restart the query with the offset equal to the last written offset. You want to discard all state information that hasn’t been written to the sink, start processing from the earliest ...

1 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Resolving package or namespace loading error

This article explains how to resolve a package or namespace loading error. Problem When you install and load some libraries in a notebook cell, like: %r library(BreakoutDetection) You may get a package or namespace error: Loading required package: BreakoutDetection: Error : package or namespace load failed for ‘BreakoutDetection’ in loadNamespace(i,...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Create table in overwrite mode fails when interrupted

Problem When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs: Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`'). The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.; Caus...

0 min reading time
Updated February 25th, 2022 by Adam Pavlacka

How to discover who deleted a cluster in Azure portal

If a cluster in your workspace has disappeared or been deleted, you can identify which user deleted it by running a query in the Log Analytics workspaces service in the Azure portal. Note If you do not have an analytics workspace set up, you must configure Diagnostic Logging in Azure Databricks before you continue. Load the Log Analytics workspaces ...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Checkpoint files not being deleted when using foreachBatch()

Problem You have a streaming job using foreachBatch() to process DataFrames. %scala streamingDF.writeStream.outputMode("append").foreachBatch { (batchDF: DataFrame, batchId: Long) =>   batchDF.write.format("parquet").mode("overwrite").save(output_directory) }.start() Checkpoint files are being created, but are not being deleted. You can verify th...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Drop tables with corrupted metadata from the metastore

Problem Sometimes you cannot drop a table from the Databricks UI. Using %sql or spark.sql to drop table doesn’t work either. Cause The metadata (table schema) stored in the metastore is corrupted. When you run Drop table command, Spark checks whether table exists or not before dropping the table. Since the metadata is corrupted for the table Spark c...

0 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Append output is not supported without a watermark

Problem You are performing an aggregation using append mode and an exception error message is returned. Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark Cause You cannot use append mode on an aggregated DataFrame without a watermark. This is by design. Solution You must apply a...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to speed up cross-validation

Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up: Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multi...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

How to list and delete files faster in Databricks

Scenario Suppose you need to delete a table that is partitioned by year, month, date, region, and service. However, the table is huge, and there will be around 1000 part files per partition. You can list all the files in each partition and then delete them using an Apache Spark job. For example, suppose you have a table that is partitioned by a, b, ...

3 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Fix the version of R packages

When you use the install.packages() function to install CRAN packages, you cannot specify the version of the package, because the expectation is that you will install the latest version of the package and it should be compatible with the latest version of its dependencies. If you have an outdated dependency installed, it will be updated as well. Som...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

How to troubleshoot several Apache Hive metastore problems

Problem 1: External metastore tables not available When you inspect the driver logs, you see a stack trace that includes the error Required table missing: WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no possible candidates Required table missing: "DBS" in Catalog "" Schema "". DataNu...

2 min reading time
Updated February 25th, 2022 by Adam Pavlacka

Unable to load AWS credentials

Problem When you try to access AWS resources like S3, SQS or Redshift, the operation fails with the error: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [BasicAWSCredentialsProvider: Access key or secret key is null, com.amazonaws.auth.InstanceProfileCredentialsProvider@a590007a: The requested metad...

0 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Apache Spark job fails with maxResultSize exception

Problem A Spark job fails with a maxResultSize exception: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB) Cause This error occurs because the configured size limit was exceeded. The size limit applies to the total serialized ...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Nulls and empty strings in a partitioned column save as nulls

Problem If you save data containing both empty strings and null values in a column on which the table is partitioned, both values become null after writing and reading the table. To illustrate this, create a simple DataFrame: %scala import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val data = Seq(Row(1, "")...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Prevent duplicated columns when joining two DataFrames

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns. Join on columns If you join on columns, you get duplicated columns. Scala %scala val llist...

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Access denied when writing Delta Lake tables to S3

Problem Writing DataFrame contents in Delta Lake format to an S3 location can cause an error: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C827672D85516BA9; S3 Extended Request ID: Cause A write operation involving the Delta Lake format requires permissions...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Checkpoint files not being deleted when using display()

Problem You have a streaming job using display() to display DataFrames. %scala val streamingDF = spark.readStream.schema(schema).parquet(<input_path>) display(streamingDF) Checkpoint files are being created, but are not being deleted. You can verify the problem by navigating to the root directory and looking in the /local_disk0/tmp/ folder. Ch...

0 min reading time
Updated May 20th, 2022 by Adam Pavlacka

Convert nested JSON to a flattened DataFrame

This article shows you how to flatten nested JSON, using only $"column.*" and explode methods. Sample JSON file Pass the sample JSON string to the reader. %scala val json =""" {         "id": "0001",         "type": "donut",         "name": "Cake",         "ppu": 0.55,         "batters":                 {                         "batter":           ...

1 min reading time
Updated May 11th, 2022 by Adam Pavlacka

Maximum execution context or notebook attachment limit reached

Problem Notebook or job execution stops and returns either of the following errors: Run result unavailable: job failed with error message Context ExecutionContextId(1731742567765160237) is disconnected. Can’t attach this notebook because the cluster has reached the attached notebook limit. Detach a notebook and retry. Cause When you attach a noteboo...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Verify the version of Log4j on your cluster

Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. If you are using Log4j within your cluster (for example, if you are processing ...

2 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Spark job fails with Driver is temporarily unavailable

Problem A Databricks notebook returns the following error: Driver is temporarily unavailable This issue can be intermittent or not. A related error message is: Lost connection to cluster. The notebook may have been detached. Cause One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver cras...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

How to overwrite log4j configurations on Databricks clusters

Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

List all workspace objects

You can use the Databricks Workspace API (AWS | Azure | GCP) to recursively list all workspace objects under a given path. Common use cases for this include: Indexing all notebook names and types for all users in your workspace. Use the output, in conjunction with other API calls, to delete unused workspaces or to manage notebooks. Dynamically get t...

1 min reading time
Updated May 17th, 2022 by Adam Pavlacka

Troubleshooting unresponsive Python notebooks or canceled commands

This article provides an overview of troubleshooting steps you can take if a notebook is unresponsive or cancels commands. Check metastore connectivity Problem Simple commands in newly-attached notebooks fail, but succeed in notebooks that were attached to the same cluster earlier. Troubleshooting steps Check metastore connectivity. The inability to...

0 min reading time
Updated May 16th, 2022 by Adam Pavlacka

Data too long for column error

Problem You are trying to insert a struct into a table, but you get a java.sql.SQLException: Data too long for column error. Caused by: java.sql.SQLException: Data too long for column 'TYPE_NAME' at row 1 Query is: INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,`COLUMN_NAME`,TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?) , parameters [103182,<null>,'address...

1 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Delta Merge cannot resolve nested field

Problem You are attempting a Delta Merge with automatic schema evolution, but it fails with a Delta Merge: cannot resolve 'field' due to data type mismatch error message. Cause This can happen if you have made changes to the nested column fields. For example, assume we have a column called Address with the fields streetName, houseNumber, and city ne...

0 min reading time
Updated May 18th, 2022 by Adam Pavlacka

Kafka error: No resolvable bootstrap urls

Problem You are trying to read or write data to a Kafka stream when you get an error message. kafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer Caused by: kafkashaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers If you are running a notebook, the error me...

0 min reading time
Updated March 8th, 2022 by Adam Pavlacka

Cannot read Databricks objects stored in the DBFS root directory

Problem An Access Denied error returns when you attempt to read Databricks objects stored in the DBFS root directory in blob storage from outside a Databricks cluster. Cause This is normal behavior for the DBFS root directory. Databricks stores objects like libraries and other temporary system files in the DBFS root directory. Databricks is the only...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Hive UDFs

This article shows how to create a Hive UDF, register it in Spark, and use it in a Spark SQL query. Here is a Hive UDF that takes a long as an argument and returns its hexadecimal representation. %scala import org.apache.hadoop.hive.ql.exec.UDF import org.apache.hadoop.io.LongWritable // This UDF takes a long integer and converts it to a hexadecimal...

0 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Install and compile Cython

This document explains how to run Spark code with compiled Cython code. The steps are as follows: Creates an example Cython module on DBFS (AWS | Azure). Adds the file to the Spark session. Creates a wrapper method to load the module on the executors. Runs the mapper on a sample dataset. Generate a larger dataset and compare the performance with nat...

2 min reading time
Updated May 19th, 2022 by Adam Pavlacka

Run C++ code in Python

Run C++ from Python example notebook Review the Run C++ from Python notebook to learn how to compile C++ code and run it on a cluster....

0 min reading time
Updated May 10th, 2022 by Adam Pavlacka

Delta Lake UPDATE query fails with IllegalState exception

Problem When you execute a Delta Lake UPDATE, DELETE, or MERGE query that uses Python UDFs in any of its transformations, it fails with the following exception: AWS java.lang.UnsupportedOperationException: Error in SQL statement: IllegalStateException: File (s3a://xxx/table1) to be rewritten not found among candidate files: s3a://xxx/table1/part-000...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

How to specify skew hints in dataset and DataFrame-based join commands

When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization (AWS | Azure | GCP) is performed on the DataFrame for which you specify the skew hint. In...

0 min reading time
Updated May 31st, 2022 by Adam Pavlacka

Access denied when writing to an S3 bucket using RDD

Problem Writing to an S3 bucket using RDDs fails. The driver node can write, but the worker (executor) node returns an access denied error. Writing with the DataFrame API, however works fine. For example, let’s say you run the following code: %scala import java.io.File import java.io.Serializable import org.apache.spark.{SparkConf, SparkContext} imp...

1 min reading time
Updated May 31st, 2022 by Adam Pavlacka

How to update nested columns

Spark doesn’t support adding new columns or dropping existing columns in nested structures. In particular, the withColumn and drop methods of the Dataset class don’t allow you to specify a column name different from any top level columns. For example, suppose you have a dataset with the following schema: %scala val schema = (new StructType)       .a...

0 min reading time
Updated March 4th, 2022 by Adam Pavlacka

Null column values display as NaN

Problem You have a table with null values in some columns. When you query the table using a select statement in Databricks, the null values appear as null. When you query the table using the same select statement in Databricks SQL, the null values appear as NaN. %sql select * from default.<table-name> where <column-name> is null Databric...

0 min reading time
Load More