Find your workspace ID
Everything you do in Databricks occurs within a workspace. When you use the web UI you are interacting with clusters and notebooks in the workspace. When you run automated jobs or connect to your workspace outside of the web UI you may need to know your workspace ID. This article covers two different ways to easily find your workspace ID. Instructio...
Cannot start Databricks workspace from Google Cloud console
Problem You are not able to launch Databricks workspaces from the Google Cloud console. The button to launch workspaces is grayed out with the message Your subscription is not active. Cause You do not have an active Databricks subscription on this Google account. The most likely cause is the creation of two (or more) trial accounts, with different b...
Failed to add user error due to email or username already existing with a different case
Problem You are trying to add a new user to your Databricks workspace via the Admin Console, but it fails with an error saying that the email or username already exists with a different case. Failed to add user: A user with email <email-address> and username <username> in difference cases already exists in the account. You may also get t...
Setup cross account bucket access in Google Cloud
Introduction When you setup a Databricks workspace on your Google Cloud Platform (GCP) account, you have access to the account's storage buckets by default. More complex use cases are likely to involve access to storage buckets that are owned by different GCP accounts. This article shows you how to use the Google Cloud SDK gsutil commands to setup c...
Configure Simba ODBC driver with a proxy in Windows
In this article you learn how to configure the Databricks ODBC Driver when your local Windows machine is behind a proxy server. Download the Simba driver for Windows Download and install the latest version of the Databricks ODBC Driver for Windows. Add proxy settings to the Windows registry Open the Windows registry and add the proxy settings to the...
Troubleshooting JDBC and ODBC connections
DBR Version: <list all applicable DBR versions> Cloud Version: AWS, Azure, GCP Author: <Databricks email of author> Owning Team: <Region + Platform/Spark> Ticket URL: <Link to original Salesforce or Jira ticket> Last reviewed date: May 05, 2021 This article provides information to help you troubleshoot the connection between ...
Install a private PyPI repo
Certain use cases may require you to install libraries from private PyPI repositories. If you are installing from a public repository, you should review the library documentation. This article shows you how to configure an example init script that authenticates and downloads a PyPI library from a private repository. Create init script Create (or ver...
Cannot apply updated cluster policy
Problem You are attempting to update an existing cluster policy, however the update does not apply to the cluster associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not applied or saved. Cause This is a known issue that is being addressed. Solution You can use a workaround until a permanent fix ...
Cluster Apache Spark configuration not applied
Problem Your cluster’s Spark configuration values are not applied. Cause This happens when the Spark config values are declared in the cluster configuration as well as in an init script. When Spark config values are located in more than one place, the configuration in the init script takes precedence and the cluster ignores the configuration setting...
Admin user cannot restart cluster to run job
Problem When a user who has permission to start a cluster, such as a Databricks Admin user, submits a job that is owned by a different user, the job fails with the following message: Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The error received from the cluster manager was: 'You are no...
Cluster fails to start with dummy does not exist error
Problem You try to start a cluster, but it fails to start. You get an Apache Spark error message. Internal error message: Spark error: Driver down You review the cluster driver and worker logs and see an error message containing java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist. 21/07/14 21:44:06 ERROR DriverDaemon$: X...
Cluster slowdown due to Ganglia metrics filling root partition
Note This article applies to Databricks Runtime 7.3 LTS and below. Problem Clusters start slowing down and may show a combination of the following symptoms: Unhealthy cluster events are reported: Request timed out. Driver is temporarily unavailable. Metastore is down. DBFS is down. You do not see any high GC events or memory utilization associated w...
Failed to create cluster with invalid tag value
Problem You are trying to create a cluster, but it is failing with an invalid tag value error message. System.Exception: Content={"error_code":"INVALID_PARAMETER_VALUE","message":"\nInvalid tag value (<<<<TAG-VALUE>>>>) - the length cannot exceed 256\nUnicode characters in UTF-8.\n "} Cause Limitations on tag Key and Value ar...
Set executor log level
Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...
Auto termination is disabled when starting a job cluster
Problem You are trying to start a job cluster, but the job creation fails with an error message. Error creating job Cluster autotermination is currently disabled. Cause Job clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination policies. If you include autotermination_minutes in your cluster poli...
How to overwrite log4j configurations on Databricks clusters
Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...
Apache Spark executor memory allocation
By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM) memory heap. This is controlled by the spark.executor.memory property. However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As JVMs scale up in memory size, issues with the garbage collecto...
Configure a cluster to use a custom NTP server
By default Databricks clusters use public NTP servers. This is sufficient for most use cases, however you can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a private NTP server under your control. A common use case is to minimize the amount of Internet traffic from your cluster. Update the NT...
Enable GCM cipher suites
Databricks clusters using Databricks Runtime 9.1 LTS and below do not have GCM (Galois/Counter Mode) cipher suites enabled by default. You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher suites. Info This article applies to clusters using Databricks Runtime 7.3 LTS and 9.1 LTS. Databricks Runti...
Enable retries in init script
Init scripts are commonly used to configure Databricks clusters. There are some scenarios where you may want to implement retries in an init script. Example init script This sample init script shows you how to implement a retry for a basic copy operation. You can use this sample code as a base for implementing retries in your own init script. %scala...
Cannot set a custom PYTHONPATH
Problem When you are trying to set a custom PYTHONPATH environment variable in a cluster-scoped init script, but the values are overridden at driver startup. Cause Setting a custom PYTHONPATH in an init scripts does not work and is not supported. Additionally, you cannot set a custom PYTHONPATH when using Databricks Container Services. Solution You...
Run a custom Databricks runtime on your cluster
The majority of Databricks customers use production Databricks runtime releases (AWS | Azure | GCP) for their clusters. However, there may be certain times when you are asked to run a custom Databricks runtime after raising a support ticket. Warning Custom Databricks runtime images are created for specific, short-term fixes and edge cases. If a cust...
Cluster init script fails with mirror sync in progress error
Problem You are using a custom init script running at cluster start to install a custom library. It works most of the time, but you encounter intermittent failures when apt-get update runs in the init script. The failures return a Mirror sync in process error message. Failed to fetch https://repos.<site>.com/zulu/deb/dists/stable/main/binary-a...
Pin cluster configurations using the API
Normally, cluster configurations are automatically deleted 30 days after the cluster was last terminated. If you want to keep specific cluster configurations, you can pin them. Up to 100 clusters can be pinned. Pinned clusters are not automatically deleted, however they can be manually deleted. Info You must be a Databricks administrator to pin a cl...
Unpin cluster configurations using the API
Normally, cluster configurations are automatically deleted 30 days after the cluster was last terminated. If you want to keep specific cluster configurations, you can pin them. Up to 100 clusters can be pinned. If you not longer need a pinned cluster, you can unpin it. If you have pinned 100 clusters, you must unpin a cluster before you can pin anot...
Append to a DataFrame
To append to a DataFrame, use the union method. %scala val firstDF = spark.range(3).toDF("myCol") val newRow = Seq(20) val appended = firstDF.union(newRow.toDF()) display(appended) %python firstDF = spark.range(3).toDF("myCol") newRow = spark.createDataFrame([[20]]) appended = firstDF.union(newRow) display(appended)...
How to improve performance with bucketing
Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and s...
How to handle blob data contained in an XML file
If you log events in XML format, then every XML event is recorded as a base64 string. In order to run analytics on this data using Apache Spark, you need to use the spark_xml library and the BASE64DECODER API to transform the data for analysis. Problem You need to analyze base64-encoded strings from an XML-formatted log file using Spark. For example...
Simplify chained transformations
Sometimes you may need to perform multiple transformations on your DataFrame: %scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrame val testDf = (1 to 10).toDF("col") def func0(x: Int => Int, y: Int)(in: DataFrame): DataFrame = { in.filter('col > x(y)) } def func1(x: Int)(in: DataFrame): DataFrame = { in.sele...
Hive UDFs
This article shows how to create a Hive UDF, register it in Spark, and use it in a Spark SQL query. Here is a Hive UDF that takes a long as an argument and returns its hexadecimal representation. %scala import org.apache.hadoop.hive.ql.exec.UDF import org.apache.hadoop.io.LongWritable // This UDF takes a long integer and converts it to a hexadecimal...
Prevent duplicated columns when joining two DataFrames
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns. Join on columns If you join on columns, you get duplicated columns. Scala %scala val llist...
Revoke all user privileges
When user permissions are explicitly granted for individual tables and views, the selected user can access those tables and views even if they don’t have permission to access the underlying database. If you want to revoke a user’s access, you can do so with the REVOKE command. However, the REVOKE command is explicit, and is strictly scoped to the ob...
How to list and delete files faster in Databricks
Scenario Suppose you need to delete a table that is partitioned by year, month, date, region, and service. However, the table is huge, and there will be around 1000 part files per partition. You can list all the files in each partition and then delete them using an Apache Spark job. For example, suppose you have a table that is partitioned by a, b, ...
How to handle corrupted Parquet files with different schema
Problem Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. You want to read only those files that match a specific schema and skip the files that don’t match. One solution could be to read the files in sequence, identify the schema, and union the DataFrames together. However, this approach ...
No USAGE permission on database
Problem You are using a cluster running Databricks Runtime 7.3 LTS and above. You have enabled table access control for your workspace (AWS | Azure | GCP) as the admin user, and granted the SELECT privilege to a standard user-group that needs to access the tables. A user tries to access an object in the database and gets a SecurityException error me...
Nulls and empty strings in a partitioned column save as nulls
Problem If you save data containing both empty strings and null values in a column on which the table is partitioned, both values become null after writing and reading the table. To illustrate this, create a simple DataFrame: %scala import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val data = Seq(Row(1, "")...
Behavior of the randomSplit method
When using randomSplit on a DataFrame, you could potentially observe inconsistent behavior. Here is an example: %python df = spark.read.format('inconsistent_data_source').load() a,b = df.randomSplit([0.5, 0.5]) a.join(broadcast(b), on='id', how='inner').count() Typically this query returns 0. However, depending on the underlying data source or input...
Generate schema from case class
Spark provides an easy way to generate a schema from a Scala case class. For case class A, use the method ScalaReflection.schemaFor[A].dataType.asInstanceOf[StructType]. For example: %scala import org.apache.spark.sql.types.StructType import org.apache.spark.sql.catalyst.ScalaReflection case class A(key: String, time: java.sql.Timestamp, date: java....
How to specify skew hints in dataset and DataFrame-based join commands
When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization (AWS | Azure | GCP) is performed on the DataFrame for which you specify the skew hint. In...
How to update nested columns
Spark doesn’t support adding new columns or dropping existing columns in nested structures. In particular, the withColumn and drop methods of the Dataset class don’t allow you to specify a column name different from any top level columns. For example, suppose you have a dataset with the following schema: %scala val schema = (new StructType) .a...
Incompatible schema in some files
Problem The Spark job fails with an exception like the following while reading Parquet files: Error in SQL statement: SparkException: Job aborted due to stage failure: Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0 (TID 868031, 10.111.245.219, executor 31): java.lang.UnsupportedOperationException: org.a...
Unable to infer schema for ORC error
Problem You are trying to read ORC files from a directory when you get an error message: org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually. Cause An Unable to infer the schema for ORC error occurs when the schema is not defined and Apache Spark cannot infer the schema due to: An empty directory. Us...
User does not have permission SELECT on ANY File
Problem You are trying to create an external hive table, but keep getting a User does not have permission SELECT on any file error message. java.lang.SecurityException: User does not have permission SELECT on any file. Table access control (AWS | Azure | GCP) is enabled your cluster and you are not an admin. Cause The Databricks SQL query analyzer e...
Create tables on JSON datasets
In this article we cover how to create a table on JSON datasets using SerDe. Download the JSON SerDe JAR Open the hive-json-serde 1.3.8 download page. Click on json-serde-1.3.8-jar-with-dependencies.jar to download the file json-serde-1.3.8-jar-with-dependencies.jar. Info You can review the Hive-JSON-Serde GitHub repo for more information on the JAR...
Optimize read performance from JDBC data sources
Problem Reading data from an external JDBC database is slow. How can I improve read performance? Solution See the detailed discussion in the Databricks documentation on how to optimize performance when reading data (AWS | Azure | GCP) from an external JDBC database....
Failure to detect encoding in JSON
Problem Spark job fails with an exception containing the message: Invalid UTF-32 character 0x1414141(above 10ffff) at char #1, byte #7) At org.apache.spark.sql.catalyst.json.JacksonParser.parse Cause The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. However, BOM is not ...
Inconsistent timestamp results with JDBC applications
Problem When using JDBC applications with Databricks clusters you see inconsistent java.sql.Timestamp results when switching between standard time and daylight saving time. Cause Databricks clusters use UTC by default. java.sql.Timestamp uses the JVM’s local time zone. If a Databricks cluster returns 2021-07-12 21:43:08 as a string, the JVM parses i...
Kafka client terminated with OffsetOutOfRangeException
Problem You have an Apache Spark application that is trying to fetch messages from an Apache Kafka source when it is terminated with a kafkashaded.org.apache.kafka.clients.consumer.OffsetOutOfRangeException error message. Cause Your Spark application is trying to fetch expired data offsets from Kafka. We generally see this in these two scenarios: Sc...
Recursive references in Avro schema are not allowed
Problem Apache Spark returns an error when trying to read from an Apache Avro data source if the Avro schema has a recursive reference. org.apache.spark.sql.avro.IncompatibleSchemaException: Found recursive reference in Avro schema, which can not be processed by Spark Cause Spark SQL does not support recursive references in an Avro data source becau...
SQL access control error when using Snowflake as a data source
Problem The Snowflake Connector for Spark is used to read data from, and write data to, Snowflake while working in Databricks. The connector makes Snowflake look like another Spark data source. When you try to query Snowflake, your get a SnowflakeSQLException error message. SnowflakeSQLException: SQL access control error: Insufficient privileges to ...
Null column values display as NaN
Problem You have a table with null values in some columns. When you query the table using a select statement in Databricks, the null values appear as null. When you query the table using the same select statement in Databricks SQL, the null values appear as NaN. %sql select * from default.<table-name> where <column-name> is null Databric...
Retrieve queries owned by a disabled user
When a Databricks SQL user is removed from an organization, the queries owned by the user remain, but they are only visible to those who already have permission to access them. A Databricks SQL admin can transfer ownership to other users, as well as delete alerts, dashboards, and queries owned by the disabled user account. Clone a query A Databricks...
Job timeout when connecting to a SQL endpoint over JDBC
Problem You have a job that is reading and writing to an SQL endpoint over a JDBC connection. The SQL warehouse fails to execute the job and you get a java.net.SocketTimeoutException: Read timed out error message. 2022/02/04 17:36:15 - TI_stg_trade.0 - Caused by: com.simba.spark.jdbc42.internal.apache.thrift.transport.TTransportException: java.net.S...
Slowness when fetching results in Databricks SQL
Problem Databricks SQL uses cloud fetch to increase query performance. This is done by default. Instead of using single threaded queries, cloud fetch retrieves data in parallel from cloud storage buckets (such as AWS S3 and Azure Data Lake Storage). Compared to a standard, single threaded fetch, you can see up to a 10X increase in performance using ...
Apache Spark session is null in DBConnect
Problem You are trying to run your code using Databricks Connect ( AWS | Azure | GCP ) when you get a sparkSession is null error message. java.lang.AssertionError: assertion failed: sparkSession is null while trying to executeCollectResult at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(...
Failed to create process error with Databricks CLI in Windows
Problem While trying to access the Databricks CLI (AWS | Azure | GCP) in Windows, you get a failed to create process error message. Cause This can happen: If multiple instances of the Databricks CLI are installed on the system. If the Python path on your Windows system includes a space. Info There is a known issue in pip which causes pip installed s...
GeoSpark undefined function error with DBConnect
Problem You are trying to use the GeoSpark function st_geofromwkt with DBConnect (AWS | Azure | GCP) and you get an Apache Spark error message. Error: org.apache.spark.sql.AnalysisException: Undefined function: 'st_geomfromwkt'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; T...
Get Apache Spark config in DBConnect
You can always view the Spark configuration (AWS | Azure | GCP) for your cluster by reviewing the cluster details in the workspace. If you are using DBConnect (AWS | Azure | GCP) you may want to quickly review the current Spark configuration details without switching over to the workspace UI. This example code shows you how to get the current Spark ...
ProtoSerializer stack overflow error in DBConnect
Problem You are using DBConnect (AWS | Azure | GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error. py4j.protocol.Py4JJavaError: An error occurred while calling o945.count. : java.lang.StackOverflowError at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Clas...
Use tcpdump to create pcap files
If you want to analyze the network traffic between nodes on a specific cluster, you can install tcpdump on the cluster and use it to dump the network packet details to pcap files. The pcap files can then be downloaded to a local machine for analysis. Create the tcpdump init script Run this sample script in a notebook on the cluster to create the ini...
Terraform registry does not have a provider error
Problem You are installing the Databricks Terraform provider ( AWS | Azure | GCP) and get a Databricks provider registry error. Error while installing hashicorp/databricks: provider registry registry.terraform.io does not have a provider named registry.terraform.io/hashicorp/databricks Cause This error occurs when the required_providers block is not...
Compare two versions of a Delta table
Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table. One common use case is to compare two versions of a Delta table in order to identify what changed. For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP). Identify all differences You can use a SQL SELEC...
Delta Merge cannot resolve nested field
Problem You are attempting a Delta Merge with automatic schema evolution, but it fails with a Delta Merge: cannot resolve 'field' due to data type mismatch error message. Cause This can happen if you have made changes to the nested column fields. For example, assume we have a column called Address with the fields streetName, houseNumber, and city ne...
How Delta cache behaves on an autoscaling cluster
This article is about how Delta cache (AWS | Azure | GCP) behaves on an auto-scaling cluster, which removes or adds nodes as needed. When a cluster downscales and terminates nodes: A Delta cache behaves in the same way as an RDD cache. Whenever a node goes down, all of the cached data in that particular node is lost. Delta cache data is not moved fr...
How to improve performance of Delta Lake MERGE INTO queries using partition pruning
This article explains how to trigger partition pruning in Delta Lake MERGE INTO (AWS | Azure | GCP) queries from Databricks. Partition pruning is an optimization technique to limit the number of partitions that are inspected by a query. Discussion MERGE INTO is an expensive operation when used with Delta tables. If you don’t partition the underlying...
Best practices for dropping a managed Delta Lake table
Regardless of how you drop a managed table, it can take a significant amount of time, depending on the data size. Delta Lake managed tables in particular contain a lot of metadata in the form of transaction logs, and they can contain duplicate data files. If a Delta table has been in use for a long time, it can accumulate a very large amount of data...
How to populate or update columns in an existing Delta table
Problem You have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file. Solution In this example, there is a customers table, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format. Create a DataFrame from th...
Identify duplicate data on append operations
A common issue when performing append operations on Delta tables is duplicate data. For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table. In this article, we review basic troubleshooting steps that you can use to...
Optimize a Delta sink in a structured streaming application
You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster. If your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch. The autoOptimize operation compac...
Unable to cast string to varchar
Problem You are trying to cast a string type column to varchar but it isn’t working. Info The varchar data type (AWS | Azure | GCP) is available in Databricks Runtime 8.0 and above. Create a simple Delta table, with one column as type string.%sql CREATE OR REPLACE TABLE delta_table1 (`col1` string) USING DELTA; Use SHOW TABLE on the newly created ta...
Vaccuming with zero retention results in data loss
Problem You add data to a Delta table, but the data disappears without warning. There is no obvious error message. Cause This can happen when spark.databricks.delta.retentionDurationCheck.enabled is set to false and VACUUM is configured to retain 0 hours. %sql VACUUM <name-of-delta-table> RETAIN 0 HOURS OR %sql VACUUM delta.`<delta_table_pa...
Z-Ordering will be ineffective, not collecting stats
Problem You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns. AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns. Info Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more infor...
Change cluster config for Delta Live Table pipeline
Problem You are using Delta Live Tables and want to change the cluster configuration. You create a pipeline, but only have options to enable or disable Photon and select the number of workers. Cause When you create a Delta Live Table pipeline, most parameters are configured with default values. These values cannot be configured before the pipeline i...
Different tables with same data generate different plans when used in same query
Problem Assume you have two Delta tables test_table_1 and test_table_2. Both tables have the same schema, same data volume, same partitions, and contain the same number of files. You are doing a join transformation with another Delta table, test_table_join, which has a million records. When you run the below join queries using test_table_1 and test_...
Allow spaces and special characters in nested column names with Delta tables
Problem It is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names. When you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an AnalysisException error message. For example, if you try to read a JSON file, evaluate the DataFrame, and ...
Delta writing empty files when source is empty
Problem Delta writes can result in the creation of empty files if the source is empty. This can happen with a regular Delta write or a MERGE INTO (AWS | Azure | GCP) operation. If your streaming application is writing to a target Delta table and your source data is empty on certain micro batches, it can result in writing empty files to your target D...
Delta Live Tables pipelines are not running VACUUM automatically
Problem Delta Live Tables supports auto-vacuum by default. You setup a Delta Live Tables pipeline, but notice VACUUM is not running automatically. Cause A Delta Live Tables pipeline needs a separate maintenance cluster configuration (AWS | Azure | GCP) inside the pipeline settings to ensure VACUUM runs automatically. If the maintenance cluster is n...
VACUUM best practices on Delta Lake
Why use VACUUM on Delta Lake? VACUUM is used to clean up unused and stale data files that are taking up unnecessary storage space. Removing these files can help reduce storage costs. When you run VACUUM on a Delta table it removes the following files from the underlying file system: Any data files that are not maintained by Delta Lake Removes stale...
OPTIMIZE is only supported for Delta tables error on Delta Lake
Problem You run OPTIMIZE on a Delta table and get an error message saying it is only supported on Delta tables. Error: `<database-name>`.`<table-name>`is not a Delta table. OPTIMIZE is only supported for Delta tables. Cause This can happen if the target table's storage location was modified and the table was recreated with a new storage ...
Distinguish active and dead jobs
Problem On clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages. Cause Whenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal eventListenerBus...
Spark job fails with Driver is temporarily unavailable
Problem A Databricks notebook returns the following error: Driver is temporarily unavailable This issue can be intermittent or not. A related error message is: Lost connection to cluster. The notebook may have been detached. Cause One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver cras...
How to delete all jobs using the REST API
Run the following commands to delete all jobs in a Databricks workspace. Identify the jobs to delete and list them in a text file:%sh curl -X GET -u "Bearer: <token>" https://<databricks-instance>/api/2.0/jobs/list | grep -o -P 'job_id.{0,6}' | awk -F':' '{print $2}' >> job_id.txt Run the curlcommand in a loop to delete the identif...
Job cluster limits on notebook output
Problem You are running a notebook on a job cluster and you get an error message indicating that the output is too large. The output of the notebook is too large. Cause: rpc response (of 20975548 bytes) exceeds limit of 20971520 bytes Cause This error message can occur in a job cluster whenever the notebook output is greater then 20 MB. If you are u...
Job fails, but Apache Spark tasks finish
Problem Your Databricks job reports a failed status, but all Spark jobs and tasks have successfully completed. Cause You have explicitly called spark.stop() or System.exit(0) in your code. If either of these are called, the Spark context is stopped, but the graceful shutdown and handshake with the Databricks job service does not happen. Solution Do ...
Job fails due to job rate limit
Problem A Databricks notebook or Jobs API request returns the following error: Error : {"error_code":"INVALID_STATE","message":"There were already 1000 jobs created in past 3600 seconds, exceeding rate limit: 1000 job creations per 3600 seconds."} Cause This error occurs because the number of jobs per hour exceeds the limit of 1000 established by Da...
Job fails with invalid access token
Problem Long running jobs, such as streaming jobs, fail after 48 hours when using dbutils.secrets.get() (AWS | Azure | GCP). For example: %python streamingInputDF1 = ( spark .readStream .format("delta") .table("default.delta_sorce") ) def writeIntodelta(batchDF, batchId): table_name = dbutil...
Task deserialization time is high
Problem Your tasks are running slower than expected. You review the stage details in the Spark UI on your cluster and see that task deserialization time is high. Cause Cluster-installed libraries (AWS | Azure | GCP) are only installed on the driver when the cluster is started. These libraries are only installed on the executors when the first tasks ...
Pass arguments to a notebook as a list
There is no direct way to pass arguments to a notebook as a dictionary or list. You can work around this limitation by serializing your list as a JSON file and then passing it as one argument. After passing the JSON file to the notebook, you can parse it with json.loads(). Instructions Define the argument list and convert it to a JSON file. Start by...
Uncommitted files causing data duplication
Problem You had a network issue (or similar) while a write operation was in progress. You are rerunning the job, but partially uncommitted files during the failed run are causing unwanted data duplication. Cause How Databricks commit protocol works: The DBIO commit protocol (AWS | Azure | GCP) is transactional. Files are only committed after a trans...
Multi-task workflows using incorrect parameter values
Problem Using key-value parameters in a multi task workflow is a common use case. It is normal to have multiple tasks running in parallel and each task can have different parameter values for the same key. These key-value parameters are read within the code and used by each task. For example, assume you have four tasks: task1, task2, task3, and task...
Job fails with Spark Shuffle FetchFailedException error
Problem If your application contains any aggregation or join stages, the execution will require a Spark Shuffle stage. Depending on the specific configuration used, if you are running multiple streaming queries on an interactive cluster you may get a shuffle FetchFailedException error. ShuffleMapStage has failed the maximum allowable number of times...
Users unable to view job results when using remote Git source
Problem You are running a job using notebooks that are stored in a remote Git repository (AWS | Azure | GCP). Databricks users with Can View permissions (who are not a workspace admin or owners of the job) cannot access or view the results of ephemeral jobs submitted via dbutils.notebook.run() from parent notebook. Cause When job visibility control ...
Single scheduled job tries to run multiple times
Problem You schedule a job (AWS | Azure | GCP) to run once per day, using Quartz Cron Syntax, but the job tries to run multiple times on the same day. Cause When the job was configured, it was scheduled by manually entering the cron syntax and a special character * was accidentally set for the seconds value. This tells the cron scheduler to run the ...
Cannot import TabularPrediction from AutoGluon
Problem You are trying to import TabularPrediction from AutoGluon, but are getting an error message. ImportError: cannot import name 'TabularPrediction' from 'autogluon' (unknown location) This happens when AutoGluon is installed via a notebook or as a cluster-installed library (AWS | Azure | GCP). You can reproduce the error by running the import c...
How to correctly update a Maven library in Databricks
Problem You make a minor update to a library in the repository, but you don’t want to change the version number because it is a small change for testing purposes. When you attach the library to your cluster again, your code changes are not included in the library. Cause One strength of Databricks is the ability to install third-party or custom libra...
Init script fails to download Maven JAR
Problem You have an init script that is attempting to install a library via Maven, but it fails when trying to download a JAR. https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.4.1/rapids-4-spark_2.12-0.4.1.jar%0D Resolving repo1.maven.org (repo1.maven.org)... 151.101.248.209 Connecting to repo1.maven.org (repo1.maven.org)|151.101.248....
Install package using previous CRAN snapshot
Problem You are trying to install a library package via CRAN, and are getting a Library installation failed for library due to infra fault error message. Library installation failed for library due to infra fault for Some(cran { package: "<name-of-package>" } ). Error messages: java.lang.RuntimeException: Installation failed with message: Erro...
Install PyGraphViz
PyGraphViz Python libraries are used to plot causal inference networks. If you try to install PyGraphViz as a standard library, it fails due to dependency errors. PyGraphViz has the following dependencies: python3-dev graphviz libgraphviz-dev pkg-config Install via notebook Install the dependencies with apt-get.%sh sudo apt-get install -y python3-de...
Install Turbodbc via init script
Turbodbc is a Python module that uses the ODBC interface to access relational databases. It has dependencies on libboost-all-dev, unixodbc-dev, and python-dev packages, which need to be installed in order. You can install these manually, or you can use an init script to automate the install. Create the init script Run this sample script in a noteboo...
Cannot uninstall library from UI
Problem Usually, libraries can be uninstalled in the Clusters UI. If the checkbox to select the library is disabled, then it’s not possible to uninstall the library from the UI. Cause If you create a library using REST API version 1.2 and if auto-attach is enabled, the library is installed on all clusters. In this scenario, the Clusters UI checkbox ...
Error when installing Cartopy on a cluster
Problem You are trying to install Cartopy on a cluster and you receive a ManagedLibraryInstallFailed error message. java.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, cartopy==0.17.0, --disable-pip-version-check) exited with code 1. ERROR: Command errored out ...
Error when installing pyodbc on a cluster
Problem One of the following errors occurs when you use pip to install the pyodbc library. java.lang.RuntimeException: Installation failed with message: Collecting pyodbc "Library installation is failing due to missing dependencies. sasl and thrift_sasl are optional dependencies for SASL or Kerberos support" Cause Although sasl and thrift_sasl are o...
Libraries fail with dependency exception
Problem You have a Python function that is defined in a custom egg or wheel file and also has dependencies that are satisfied by another customer package installed on the cluster. When you call this function, it returns an error that says the requirement cannot be satisfied. org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnv...
Reading .xlsx files with xlrd fails
Problem You are have xlrd installed on your cluster and are attempting to read files in the Excel .xlsx format when you get an error. XLRDError: Excel xlsx file; not supported Cause xlrd 2.0.0 and above can only read .xls files. Support for .xlsx files was removed from xlrd due to a potential security vulnerability. Solution Use openpyxl to open .xl...
Remove Log4j 1.x JMSAppender and SocketServer classes from classpath
Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. Databricks also does not use the affected classes from Log4j 1.x with known vul...
Replace a default library jar
Databricks includes a number of default Java and Scala libraries. You can replace any of these libraries with another version by using a cluster-scoped init script to remove the default library jar and then install the version you require. Warning Removing default libraries and installing new versions may cause instability or completely break your D...
Python command fails with AssertionError: wrong color format
Problem You run a Python notebook and it fails with an AssertionError: wrong color format message. An example stack trace: File "/local_disk0/tmp/1599775649524-0/PythonShell.py", line 39, in <module> from IPython.nbconvert.filters.ansi import ansi2html File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<...
PyPMML fails with Could not find py4j jar error
Problem PyPMML is a Python PMML scoring library. After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error. %python from pypmml import Model modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml') Error : Py4JError: Could not find py4j jar at Cause This error occurs due to a dependency on the defa...
TensorFlow fails to import
Problem You have TensorFlow installed on your cluster. When you try to import TensorFlow, it fails with an Invalid Syntax or import error. Cause The version of protobuf installed on your cluster is not compatible with your version of TensorFlow. Solution Use a cluster-scoped init script to install TensorFlow with matching versions of NumPy and proto...
Verify the version of Log4j on your cluster
Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. If you are using Log4j within your cluster (for example, if you are processing ...
Apache Spark jobs fail with Environment directory not found error
Problem After you install a Python library (via the cluster UI or by using pip), your Apache Spark jobs fail with an Environment directory not found error message. org.apache.spark.SparkException: Environment directory not found at /local_disk0/.ephemeral_nfs/cluster_libraries/python Cause Libraries are installed on a Network File System (NFS) on th...
Use Databricks Repos with Docker container services
Introduction Depending on your use case, you may want to use both Docker Container Services (DCS) and Databricks Repos (AWS | Azure | GCP) at the same time. DCS does not work with Databricks Repos by default, however you can use a custom init script to use both. If you have not installed an init script to configure DCS with Databricks Repos you may ...
Copy installed libraries from one cluster to another
If you have a highly customized Databricks cluster, you may want to duplicate it and use it for other projects. When you clone a cluster, only the Apache Spark configuration and other cluster configuration information is copied. Installed libraries are not copies by default. To copy the installed libraries, you can run a Python script after cloning ...
Conda fails to download packages from Anaconda
Problem You are attempting to download packages from the Anaconda repository and get a PackagesNotFoundError error message. This error can occur when using %conda, or %sh conda in notebooks, and when using Conda in an init script. Cause Anaconda Inc. updated the terms of service for repo.anaconda.com and anaconda.org/anaconda. Based on the Anaconda ...
Download artifacts from MLflow
By default, the MLflow client saves artifacts to an artifact store URI during an experiment. The artifact store URI is similar to /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. This artifact store is a MLflow managed location, so you cannot download artifacts directly. You must use client.download_artifacts in the ...
How to extract feature information for tree-based Apache SparkML pipeline models
When you are fitting a tree-based model, such as a decision tree, random forest, or gradient boosted tree, it is helpful to be able to review the feature importance levels along with the feature names. Typically models in SparkML are fit as the last stage of the pipeline. To extract the relevant feature information from the pipeline with the tree mo...
Fitting an Apache SparkML model throws error
Problem Databricks throws an error when fitting a SparkML model or Pipeline: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168, 10.205.250.130, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfu...
H2O.ai Sparkling Water cluster not reachable
Problem You are trying to initialize H2O.ai’s Sparkling Water on Databricks Runtime 7.0 and above when you get a H2OClusterNotReachableException error message. %python import ai.h2o.sparkling._ val h2oContext = H2OContext.getOrCreate() ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster X.X.X.X:54321 - sparkling-water-ro...
How to perform group K-fold cross validation with Apache Spark
Cross validation randomly splits the training data into a specified number of folds. To prevent data leakage where the same data shows up in multiple folds you can use groups. scikit-learn supports group K-fold cross validation to ensure that the folds are distinct and non-overlapping. On Spark you can use the spark-sklearn library, which distribute...
MLflow project fails to access an Apache Hive table
Problem You have an MLflow project that fails to access a Hive table and returns a Table or view not found error. pyspark.sql.utils.AnalysisException: "Table or view not found: `default`.`tab1`; line 1 pos 21;\n'Aggregate [unresolvedalias(count(1), None)]\n+- 'UnresolvedRelation `default`.`tab1`\n" xxxxx ERROR mlflow.cli: === Run (ID 'xxxxx') failed...
How to speed up cross-validation
Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up: Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multi...
Hyperopt fails with maxNumConcurrentTasks error
Problem You are tuning machine learning parameters using Hyperopt when your job fails with a py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist error. You are using a Databricks Runtime for Machine Learning (Databricks Runtime ML) cluster. Cause Databricks Runtime ML has a compatible version of Hyperopt pre-installed (AWS | Azure | ...
Incorrect results when using documents as inputs
Problem You have a ML model that takes documents as inputs, specifically, an array of strings. You use a feature extractor like TfidfVectorizer to convert the documents to an array of strings and ingest the array into the model. The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JS...
Experiment warning when custom artifact storage location is used
Problem When you create an MLflow experiment with a custom artifact location, you get the following warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. MLflow cannot guarantee the enforcement of access controls on ar...
Experiment warning when legacy artifact storage location is used
Problem A new icon appears on the MLflow Experiments page with the following open access warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. In MLflow 1.11 and above, new experiments store artifacts in an MLflow-mana...
KNN model using pyfunc returns ModuleNotFoundError or FileNotFoundError
Problem You have created a Sklearn model using KNeighborsClassifier and are using pyfunc to run a prediction. For example: %python import mlflow.pyfunc pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type='string') predicted_df = merge.withColumn("prediction", pyfunc_udf(*merge.columns[1:])) predicted_df.collect() The predict...
OSError when accessing MLflow experiment artifacts
Problem You get an OSError: No such file or directory error message when trying to download or log artifacts using one of the following: MlflowClient.download_artifacts() mlflow.[flavor].log_model() mlflow.[flavor].load_model() mlflow.log_artifacts() OSError: No such file or directory: '/dbfs/databricks/mlflow-tracking/<experiment-id>/<run-...
PERMISSION_DENIED error when accessing MLflow experiment artifact
Problem You get a PERMISSION_DENIED error when trying to access an MLflow artifact using the MLflow client. RestException: PERMISSION_DENIED: User <user> does not have permission to 'View' experiment with id <experiment-id> or RestException: PERMISSION_DENIED: User <user> does not have permission to 'Edit' experiment with id <ex...
Runs are not nested when SparkTrials is enabled in Hyperopt
Problem SparkTrials is an extension of Hyperopt, which allows runs to be distributed to Spark workers. When you start an MLflow run with nested=True in the worker function, the results are supposed to be nested under the parent run. Sometimes the results are not correctly nested under the parent run, even though you ran SparkTrials with nested=True ...
Autoscaling is slow with an external metastore
Problem You have an external metastore configured on your cluster and autoscaling is enabled, but the cluster is not autoscaling effectively. Cause You are copying the metastore jars to every executor, when they are only needed in the driver. It takes time to initialize and run the jars every time a new executor spins up. As a result, adding more ex...
Data too long for column error
Problem You are trying to insert a struct into a table, but you get a java.sql.SQLException: Data too long for column error. Caused by: java.sql.SQLException: Data too long for column 'TYPE_NAME' at row 1 Query is: INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,`COLUMN_NAME`,TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?) , parameters [103182,<null>,'address...
Drop database without deletion
By default, the DROP DATABASE (AWS | Azure | GCP) command drops the database and deletes the directory associated with the database from the file system. Sometimes you may want to drop the database, but keep the underlying database directory intact. Example code You can use this example code to drop the database without dropping the underlying stora...
Error in CREATE TABLE with external Hive metastore
Problem You are connecting to an external MySQL metastore and attempting to create a table when you get an error. AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:An exception was thrown while adding/validating class(es) : (conn=21) Column length too big for column 'PARAM_VALUE' (max = 16383); use BLOB or TE...
Japanese character support in external metastore
Problem You are trying to use Japanese characters in your tables, but keep getting errors. Create a table with the OPTIONS keyword OPTIONS provides extra metadata to the table. You try creating a table with OPTIONS and specify the charset as utf8mb4. %sql CREATE TABLE default.JPN_COLUMN_NAMES('作成年月' string ,'計上年月' string ,'所属コード' string ,'生保代理店コード_8...
Parquet timestamp requires Hive metastore 1.2 or above
Problem You are trying to create a Parquet table using TIMESTAMP, but you get an error message. Error in SQL statement: QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384 Example code %sql CREATE EXTERN...
JSON reader parses values as null
Problem You are attempting to read a JSON file. You know the file has data in it, but the Apache Spark JSON reader is returning a null value. Example code You can use this example code to reproduce the problem. Create a test JSON file in DBFS.%python dbutils.fs.rm("dbfs:/tmp/json/parse_test.txt") dbutils.fs.put("dbfs:/tmp/json/parse_test.txt", """ {...
display() does not show microseconds correctly
Problem You want to display a timestamp value with microsecond precision, but when you use display() it does not show the value past milliseconds. For example, this Apache Spark SQL display() command: %sql display(spark.sql("select cast('2021-08-10T09:08:56.740436' as timestamp) as test")) Returns a truncated value: 2021-08-10T09:08:56.740+0000 Caus...
Error: Received command c on object id p0
Problem You have imported Python libraries, but when you try to execute Python code in a notebook you get a repeating message as output. INFO:py4j.java_gateway:Received command c on object id p0 INFO:py4j.java_gateway:Received command c on object id p0 INFO:py4j.java_gateway:Received command c on object id p0 INFO:py4j.java_gateway:Received command ...
Failure when accessing or mounting storage
Problem You are trying to access an existing mount point, or create a new mount point, and it fails with an error message. Invalid Mount Exception:The backend could not get tokens for path /mnt. Cause The root mount path (/mnt) is also mounted to a storage location. You can verify that something is mounted to the root path by listing all mount point...
Item was too large to export
Problem You are trying to export notebooks using the workspace UI and are getting an error message. This item was too large to export. Try exporting smaller or fewer items. Cause The notebook files are larger than 10 MB in size. Solution The simplest solution is to limit the size of the notebook or folder that you are trying to download to 10 MB or ...
Update job permissions for multiple users
When you are running jobs, you might want to update user permissions for multiple users. You can do this by using the Databricks job permissions API (AWS | Azure | GCP) and a bit of Python code. Instructions Copy the example code into a notebook. Enter the <job-id> (or multiple job ids) into the array arr[]. Enter your payload{}. In this examp...
Generate browser HAR files
When troubleshooting UI issues, it is sometimes necessary to obtain additional information about the network requests that are generated in your browser. If this is needed, our support team will ask you to generate a HAR file. This article describes how to generate a HAR file with each of the major web browsers. Warning HAR files contain sensitive d...
Recover deleted notebooks from the Trash
DBR Version: This applies to the Workspace UI, so it is separate from DBR versions. Category: Notebooks Secondary category: <list secondary category, if applicable> Cloud Version: AWS, Azure, GCP Author: vivian.wilfred@databricks.com Owning Team: <India + Platform> Ticket URL: <Link to original Salesforce or Jira ticket> Last revie...
Get workspace configuration details
This article explains how to display the complete configuration details for your Databricks workspace. This can be useful if you want to review the configuration settings and services that are enabled in your workspace. For example, you can use the workspace configuration details to quickly see if Unity Catalog or Identity Federation is enabled on y...
Iterate through all jobs in the workspace using Jobs API 2.1
In the Databricks Jobs API 2.0 (AWS | Azure | GCP) list returns an unbounded number of job descriptions. In the Jobs API 2.1 (AWS | Azure | GCP), this behavior has changed. The list command now returns a maximum of 25 jobs, from newest to oldest, at a time. In this article we show you how to manually iterate through all of the jobs in your workspace...
Too many execution contexts are open right now
Problem You come across the below error message when you try to attach a notebook to a cluster or in a job failure. Run result unavailable: job failed with error message Too many execution contexts are open right now.(Limit set currently to 150) Cause Databricks create an execution context when you attach a notebook to a cluster. The execution cont...
SSL exception when connecting to GCP secret manager
Info This article applies to clusters using Databricks Runtime 7.3 LTS and 9.1 LTS. Problem Secrets stored in the GCP secret manager service can be retrieved using the google-cloud-secret-manager client library. Your code may fail with an SSLHandshakeException error message on Databricks Runtime 9.1 LTS and below. Sample code: import com.google.clo...
Append output is not supported without a watermark
Problem You are performing an aggregation using append mode and an exception error message is returned. Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark Cause You cannot use append mode on an aggregated DataFrame without a watermark. This is by design. Solution You must apply a...
Apache Spark DStream is not supported
Problem You are attempting to use a Spark Discretized Stream (DStream) in a Databricks streaming job, but the job is failing. Cause DStreams and the DStream API are not supported by Databricks. Solution Instead of using Spark DStream, you should migrate to Structured Streaming. Review the Databricks Structured Streaming in production (AWS | Azure | ...
Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
When you stream data into a file sink, you should always change both checkpoint and output directories together. Otherwise, you can get failures or unexpected outputs. Apache Spark creates a folder inside the output directory named _spark_metadata. This folder contains write-ahead logs for every batch run. This is how Spark gets exactly-once guarant...
Get the path of files consumed by Auto Loader
When you process streaming files with Auto Loader (AWS | Azure | GCP), events are logged based on the files created in the underlying storage. This article shows you how to add the file path for every filename to a new column in the output DataFrame. One use case for this is auditing. When files are ingested to a partitioned folder structure there i...
How to restart a structured streaming query from last written offset
Scenario You have a stream, running a windowed aggregation query, that reads from Apache Kafka and writes files in Append mode. You want to upgrade the application and restart the query with the offset equal to the last written offset. You want to discard all state information that hasn’t been written to the sink, start processing from the earliest ...
Kafka error: No resolvable bootstrap urls
Problem You are trying to read or write data to a Kafka stream when you get an error message. kafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer Caused by: kafkashaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers If you are running a notebook, the error me...
readStream() is not whitelisted error when running a query
Problem You have table access control (AWS | Azure | GCP) enabled on your cluster. You are trying to run a structured streaming query and get and error message. py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.s...
Checkpoint files not being deleted when using display()
Problem You have a streaming job using display() to display DataFrames. %scala val streamingDF = spark.readStream.schema(schema).parquet(<input_path>) display(streamingDF) Checkpoint files are being created, but are not being deleted. You can verify the problem by navigating to the root directory and looking in the /local_disk0/tmp/ folder. Ch...
Checkpoint files not being deleted when using foreachBatch()
Problem You have a streaming job using foreachBatch() to process DataFrames. %scala streamingDF.writeStream.outputMode("append").foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.write.format("parquet").mode("overwrite").save(output_directory) }.start() Checkpoint files are being created, but are not being deleted. You can verify th...
Conflicting directory structures error
Problem You have an Apache Spark job that is failing with a Java assertion error java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Example stack trace Caused by: org.apache.spark.sql.streaming.StreamingQueryException: There was an error when trying to infer the partition schema of the current batch of files. Plea...
RocksDB fails to acquire a lock
Problem You are trying to use RocksDB as a state store for your structured streaming application, when you get an error message saying that the instance could not be acquired. Caused by: java.lang.IllegalStateException: RocksDB instance could not be acquired by [ThreadId: 742, task: 140.3 in stage 3152, TID 553193] as it was not released by [ThreadI...
Streaming job gets stuck writing to checkpoint
Problem You are monitoring a streaming job, and notice that it appears to get stuck when processing data. When you review the logs, you discover the job gets stuck when writing data to a checkpoint. INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/st...
Explicit path to data or a defined schema required for Auto loader
Info This article applies to Databricks Runtime 9.1 LTS and above. Problem You are using Auto Loader to ingest data for your ELT pipeline when you get an IllegalArgumentException: Please provide the source directory path with option `path` error message. You get this error when you start an Auto Loader job, if either the path to the data or the data...
Optimize streaming transactions with .trigger
When running a structured streaming application that uses cloud storage buckets (S3, ADLS Gen2, etc.) it is easy to incur excessive transactions as you access the storage bucket. Failing to specify a .trigger option in your streaming code is one common reason for a high number of storage transactions. When a .trigger option is not specified, the sto...
Structured streaming jobs slow down on every 10th batch
Problem You are running a series of structured streaming jobs and writing to a file sink. Every 10th run appears to run slower than the previous jobs. Cause The file sink creates a _spark_metadata folder in the target path. This metadata folder stores information about each batch, including which files are part of the batch. This is required to prov...
Get last modification time for all files in Auto Loader and batch jobs
You are running a streaming job with Auto Loader (AWS | Azure | GCP) and want to get the last modification time for each file from the storage account. Instructions The Get the path of files consumed by Auto Loader article describes how to get the filenames and paths for all files consumed by the Auto Loader. In this article, we build on that founda...
AttributeError: ‘function’ object has no attribute
Problem You are selecting columns from a DataFrame and you get an error message. ERROR: AttributeError: 'function' object has no attribute '_get_object_id' in job Cause The DataFrame API contains a small number of protected keywords. If a column in your DataFrame uses a protected keyword as the column name, you will get an error message. For example...
Convert Python datetime object to string
There are multiple ways to display date and time values with Python, however not all of them are easy to read. For example, when you collect a timestamp column from a DataFrame and save it as a Python variable, the value is stored as a datetime object. If you are not familiar with the datetime object format, it is not as easy to read as the common Y...
Display file and directory timestamp details
In this article we show you how to display detailed timestamps, including the date and time when a file was created or modified. Use ls command The simplest way to display file timestamps is to use the ls -lt <path> command in a bash shell. For example, this sample command displays basic timestamps for files and directories in the /dbfs/ folde...
Reading large DBFS-mounted files using Python APIs
This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs. Problem If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error: /databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextRead...
How to import a custom CA certificate
When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints. ConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb73dc3b3d0>: Failed t...
Job remains idle before starting
Problem You have an Apache Spark job that is triggered correctly, but remains idle for a long time before starting. You have a Spark job that ran well for awhile, but goes idle for a long time before resuming. Symptoms include: Cluster downscales to the minimum number of worker nodes during idle time. Driver logs don’t show any Spark jobs during idl...
List all workspace objects
You can use the Databricks Workspace API (AWS | Azure | GCP) to recursively list all workspace objects under a given path. Common use cases for this include: Indexing all notebook names and types for all users in your workspace. Use the output, in conjunction with other API calls, to delete unused workspaces or to manage notebooks. Dynamically get t...
Python commands fail on high concurrency clusters
Problem You are attempting to run Python commands on a high concurrency cluster. All Python commands fail with a WARN error message. WARN PythonDriverWrapper: Failed to start repl ReplId-61bef-9fc33-1f8f6-2 ExitCodeException exitCode=1: chown: invalid user: ‘spark-9fcdf4d2-045d-4f3b-9293-0f’ Cause Both spark.databricks.pyspark.enableProcessIsolation...
Cluster cancels Python command execution after installing Bokeh
Problem The cluster returns Cancelled in a Python notebook. Inspect the driver log (std.err) in the Cluster Configuration page for a stack trace and error message similar to the following: log4j:WARN No appenders could be found for logger (com.databricks.conf.trusted.ProjectConf$). log4j:WARN Please initialize the log4j system properly. log4j:WARN S...
Cluster cancels Python command execution due to library conflict
Problem The cluster returns Cancelled in a Python notebook. Notebooks in all other languages execute successfully on the same cluster. Cause When you install a conflicting version of a library, such as ipython, ipywidgets, numpy, scipy, or pandas to the PYTHONPATH, then the Python REPL can break, causing all commands to return Cancelled after 30 sec...
Python command execution fails with AttributeError
This article can help you resolve scenarios in which Python command execution fails with an AttributeError. Problem: 'tuple' object has no attribute 'type' When you run a notebook, Python command execution fails with the following error and stack trace: AttributeError: 'tuple' object has no attribute 'type' Traceback (most recent call last): File "/...
Job fails with Java IndexOutOfBoundsException error
Problem Your job fails with a Java IndexOutOfBoundsException error message: java.lang.IndexOutOfBoundsException: index: 0, length: <number> (expected: range(0, 0)) When you review the stack trace you see something similar to this: Py4JJavaError: An error occurred while calling o617.count. : org.apache.spark.SparkException: Job aborted due to s...
Change version of R (r-base)
These instructions describe how to install a different version of R (r-base) on a cluster. You can check the default r-base version that each Databricks Runtime version is installed with in the System environment section of each Databricks Runtime release note (AWS | Azure | GCP). List available r-base-core versions To list the versions of r-base-co...
Fix the version of R packages
When you use the install.packages() function to install CRAN packages, you cannot specify the version of the package, because the expectation is that you will install the latest version of the package and it should be compatible with the latest version of its dependencies. If you have an outdated dependency installed, it will be updated as well. Som...
How to parallelize R code with gapply
Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed com...
How to parallelize R code with spark.lapply
Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed com...
Install rJava and RJDBC libraries
This article explains how to install rJava and RJBDC libraries. Problem When you install rJava and RJDBC libraries with the following command in a notebook cell: %r install.packages(c("rJava", "RJDBC")) You observe the following error: ERROR: configuration failed for package 'rJava' Cause The rJava and RJDBC packages check for Java dependencies and ...
Rendering an R markdown file containing sparklyr code fails
Problem After you install and configure RStudio in the Databricks environment, when you launch RStudio and click the Knit button to knit a Markdown file that contains code to initialize a sparklyr context, rendering fails with the following error: failed to start sparklyr backend:object 'DATABRICKS_GUID' not found Calls: <Anonymous>… tryCatch ...
Resolving package or namespace loading error
This article explains how to resolve a package or namespace loading error. Problem When you install and load some libraries in a notebook cell, like: %r library(BreakoutDetection) You may get a package or namespace error: Loading required package: BreakoutDetection: Error : package or namespace load failed for ‘BreakoutDetection’ in loadNamespace(i,...
RStudio server backend connection error
Problem You get a backend connection error when using RStudio server. Error in Sys.setenv(EXISTING_SPARKR_BACKEND_PORT = system(paste0("wget -qO - 'http://localhost:6061/?type=\"com.databricks.backend.common.rpc.DriverMessages$StartRStudioSparkRBackend\"' --post-data='{\"@class\":\"com.databricks.backend.common.rpc.DriverMessages$StartRStudioSparkRB...
Verify R packages installed via init script
When you configure R packages to install via an init script, it is possible for a package install to fail if dependencies are not installed. You can use the R commands in a notebook to check that all of the packages correctly installed. Info This article does require you to provide a list of packages to check against. List installed packages Make a ...
Apache Spark UI is not in sync with job
Problem The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank. When you review the driver logs, you see an AsyncEventQueue warning. Logs ===== 20/12/23 21:20:26 ...
Apache Spark job fails with Parquet column cannot be converted error
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message. The cluster is running Databricks Runtime 7.3 LTS or above. org.apache.spark.SparkException: Task failed while writing rows. Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://buc...
Best practice for cache(), count(), and take()
cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(),...
Cannot import timestamp_millis or unix_millis
Problem You are trying to import timestamp_millis or unix_millis into a Scala notebook, but get an error message. %scala import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} error: value timestamp_millis is not a member of object org.apache.spark.sql.functions import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} Cau...
Cannot modify the value of an Apache Spark config
Problem You are trying to SET the value of a Spark config in a notebook and get a Cannot modify the value of a Spark config error. For example: %sql SET spark.serializer=org.apache.spark.serializer.KryoSerializer Error in SQL statement: AnalysisException: Cannot modify the value of a Spark config: spark.serializer; Cause The SET command does not wor...
Convert nested JSON to a flattened DataFrame
This article shows you how to flatten nested JSON, using only $"column.*" and explode methods. Sample JSON file Pass the sample JSON string to the reader. %scala val json =""" { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": ...
Create a DataFrame from a JSON string or Python dictionary
In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...
Decimal$DecimalIsFractional assertion error
Problem You are running a job on Databricks Runtime 7.x or above when you get a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional error message. Example stack trace: java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: <notebook> during phase: globalPhase=terminal, enteringPhase=j...
from_json returns null in Apache Spark 3.0
Problem The from_json function is used to parse a JSON string and return a struct of values. For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return. %python from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(c...
Manage the size of Delta tables
Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space. In this article we discuss recommendations that can help you manage the size of your Delta tables. Enable file system ve...
Select files using a pattern match
When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...
Job fails with ExecutorLostFailure due to “Out of memory” error
Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...
Job fails with ExecutorLostFailure because executor is busy
Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...
Understanding speculative execution
Speculative execution Speculative execution can be used to automatically re-attempt a task that is not making progress compared to other tasks in the same stage. This means if one or more tasks are running slower in a stage, they will be re-launched. The task that completes first is marked as successful. The other attempt gets killed. Implementatio...
Use custom classes and objects in a schema
Problem You are trying to create a dataset using a schema that contains Scala enumeration fields (classes and objects). When you run your code in a notebook cell, you get a ClassNotFoundException error. Sample code %scala object TestEnum extends Enumeration { type TestEnum = Value val E1, E2, E3 = Value } import spark.implicits._ import TestEnum._ c...
Date functions only accept int values in Apache Spark 3.0
Problem You are attempting to use the date_add() or date_sub() functions in Spark 3.0, but they are returning an Error in SQL statement: AnalysisException error message. In Spark 2.4 and below, both functions work as normal. %sql select date_add(cast('1964-05-23' as date), '12.34') Cause You are attempting to use a fractional or string value as the ...
Duplicate columns in the metadata error
Problem Your Apache Spark job is processing a Delta table when the job fails with an error message. org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the metadata update: col1, col2... Cause There are duplicate column names in the Delta table. Column names that differ only by case are considered duplicate. Delta Lake is case prese...
Generate unique increasing numeric values
This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...
Error in SQL statement: AnalysisException: Table or view not found
Problem When you try to query a table or view, you get this error: AnalysisException:Table or view not found when trying to query a global temp view Cause You typically create global temp views so they can be accessed from different sessions and kept alive until the application ends. You can create a global temp view with the following statement: %s...
Error when downloading full results after join
Problem You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error. Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/ Reproduce error Create two tables.%python from pyspark.sql.functions impo...
Error when running MSCK REPAIR TABLE in parallel
Problem You are trying to run MSCK REPAIR TABLE <table-name> commands for the same table in parallel and are getting java.net.SocketTimeoutException: Read timed out or out of memory error messages. Cause When you try to add a large number of new partitions to a table with MSCK REPAIR in parallel, the Hive metastore becomes a limiting factor, a...
Find the size of a table
This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...
Inner join drops records in result
Problem You perform an inner join, but the resulting joined table is missing data. For example, assume you have two tables, orders and models. %python df_orders = spark.createDataFrame([('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima',''), ('Nissan','Altima', None)], ["Company", "Model", "Info"]...
Data is incorrect when read from Snowflake
Problem You have a job that is using Apache Spark to read from a Snowflake table, but the time data that appears in the Dataframe is incorrect. If you run the same query directly on Snowflake, the correct time data is returned. Cause The time zone value was not correctly set. A mismatch between the time zone value of the Databricks cluster and Snowf...
JDBC write fails with a PrimaryKeyViolation error
Problem You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error. Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables. Cause When Apache Spark performs a JDBC write, one par...
Query does not skip header row on external table
Problem You are attempting to query an external Hive table, but it keeps failing to skip the header row, even though TBLPROPERTIES ('skip.header.line.count'='1') is set in the HiveContext. You can reproduce the issue by creating a table with this sample code. %sql CREATE EXTERNAL TABLE school_test_score ( `school` varchar(254), `student_id` varc...
SHOW DATABASES command returns unexpected column name
Problem You are using the SHOW DATABASES command and it returns an unexpected column name. Cause The column name returned by the SHOW DATABASES command changed in Databricks Runtime 7.0. Databricks Runtime 6.4 Extended Support and below: SHOW DATABASES returns namespace as the column name. Databricks Runtime 7.0 and above: SHOW DATABASES returns dat...
Cannot view table SerDe properties
Problem You are trying to view the SerDe properties on an Apache Hive table, but SHOW CREATE TABLE just returns the Apache Spark DDL. It does not show the SerDe properties. For example, given this sample code: %sql SHOW CREATE TABLE <table-identifier> You get a result that does not show the SerDe properties: Cause You are using Databricks Runt...
Parsing post meridiem time (PM) with to_timestamp() returns null
Problem You are trying to parse a 12-hour (AM/PM) time value with to_timestamp(), but instead of returning a 24-hour time value it returns null. For example, this sample code: %sql SELECT to_timestamp('2016-12-31 10:12:00 PM', 'yyyy-MM-dd HH:mm:ss a'); Returns null when run: Cause to_timestamp() requires the hour format to be in lowercase. If the ho...
to_json() results in Cannot use null as map key error
Problem You are using to_json() to convert data to JSON and you get a Cannot use null as map key error: RuntimeException: Cannot use null as map key. Cause to_json() function does not support using null values as the input map keys. This example code causes the Cannot use null as map key error when run, because of the null value used as a map key in...
Set nullability when using SaveAsTable with Delta tables
When creating a Delta table with saveAsTable, the nullability of columns defaults to true (columns can contain null values). This is expected behavior. In some cases, you may want to create a Delta table with the nullability of columns set to false (columns cannot contain null values). Instructions Use the CREATE TABLE command to create the table an...
Ensure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above
Problem The statistics functions covar_samp, kurtosis, skewness, std, stddev, stddev_samp, variance, and var_samp, return NaN when a divide by zero occurs during expression evaluation in Databricks Runtime 7.3 LTS. The same functions return null in Databricks Runtime 9.1 LTS and above, as well as Databricks SQL endpoints when a divide by zero occur...
Using datetime values in Spark 3.0 and above
Problem You are migrating jobs from unsupported clusters running Databricks Runtime 6.6 and below with Apache Spark 2.4.5 and below to clusters running a current version of the Databricks Runtime. If your jobs and/or notebooks process date conversions, they may fail with a SparkUpgradeException error message after running them on upgraded clusters. ...
ANSI compliant DECIMAL precision and scale
Problem You are trying to cast a value of one or greater as a DECIMAL using equal values for both precision and scale. A null value is returned instead of the expected value. This sample code: %sql SELECT CAST (5.345 AS DECIMAL(20,20)) Returns: Cause The DECIMAL type (AWS | Azure | GCP) is declared as DECIMAL(precision, scale), where precision and s...