SSO server redirects to original URL, not to vanity Databricks URL
Problem When you log into Databricks using a vanity URL (such as mycompany.cloud.databricks.com ), you are redirected to a single sign-on (SSO) server for authentication. When that server redirects you back to the Databricks website, the URL changes from the vanity URL to the original deployment URL (such as dbc-XXXX.cloud.databricks.com ). This can...
SSO SAML authentication error with PingFederate
Problem When using PingFederate to authenticate over a SSO connection with Databricks, the redirection fails with the following error: 19/12/21 01:27:01 ERROR SamlAuthenticator[root=ServiceMain-6c710d1c1fca0002 parent=ServiceMain-6c710d1c1fca0002 op=HttpServer-6c710d1c1fdf2812]: SAML login failed unexpectedly java.lang.IllegalArgumentException: com....
Find your workspace ID
Everything you do in Databricks occurs within a workspace. When you use the web UI you are interacting with clusters and notebooks in the workspace. When you run automated jobs or connect to your workspace outside of the web UI you may need to know your workspace ID. This article covers two different ways to easily find your workspace ID. Instructio...
Failed to add user error due to email or username already existing with a different case
Problem You are trying to add a new user to your Databricks workspace via the Admin Console, but it fails with an error saying that the email or username already exists with a different case. Failed to add user: A user with email <email-address> and username <username> in difference cases already exists in the account. You may also get t...
AWS services fail with No region provided error
Problem Your code snippets that use AWS services fail with a java.lang.IllegalArgumentException: No region provided error in Databricks Runtime 7.0 and above. The same code worked in Databricks Runtime 6.6 and below. You can verify the issue by running the example code snippet in a notebook. In Databricks Runtime 7.0 and above, it will return the ex...
Troubleshooting Amazon Redshift connection problems
Problem You created a VPC peering connection and configured an Amazon Redshift cluster in the peer network. When you attempt to access the Redshift cluster, you get the following error: Error message: OperationalError: could not connect to server: Connection timed out Cause This problem can occur if: VPC peering is misconfigured. The corresponding p...
Vulnerability scan shows vulnerabilities in Databricks EC2 instances
Problem The Corporate Information Security (CIS) Vulnerability Management team identifies vulnerabilities in AWS instances that are traced to EC2 instances created by Databricks (worker AMI). Cause The Databricks security team addresses all critical vulnerabilities and updates the core and worker AMIs on a regular basis. However, if there are long-r...
Configure custom DNS settings using dnsmasq
dnsmasq is a tool for installing and configuring DNS routing rules for cluster nodes. You can use it to set up routing between your Databricks environment and your on-premise network. Warning If you use your own DNS server and it goes down, you will experience an outage and will not be able to create clusters. Use the following cluster-scoped init s...
Unable to load AWS credentials
Problem When you try to access AWS resources like S3, SQS or Redshift, the operation fails with the error: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [BasicAWSCredentialsProvider: Access key or secret key is null, com.amazonaws.auth.InstanceProfileCredentialsProvider@a590007a: The requested metad...
Access denied when writing logs to an S3 bucket
Problem When you try to write log files to an S3 bucket, you get the error: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 2F8D8A07CD8817EA), S3 Extended Request ID: Cause The DBFS mount is in an S3 bucket that assumes roles and uses sse-kms encryption. Th...
S3 part number must be between 1 and 10000 inclusive
Problem When you copy a large file from the local file system to DBFS on S3, the following exception can occur: Amazon.S3.AmazonS3Exception: Part number must be an integer between 1 and 10000, inclusive Cause This is an S3 limit on segment count. Part files can only be numbered from 1 to 10000, inclusive. Solution To prevent this exception from occu...
How to analyze user interface performance issues
Problem The Databricks user interface seems to be running slowly. Cause User interface performance issues typically occur due to network latency or a database query taking more time than expected. In order to troubleshoot this type of problem, you need to collect network logs and analyze them to see which network traffic is affected. In most cases, ...
Unable to mount Azure Data Lake Storage Gen1 account
Problem When you try to mount an Azure Data Lake Storage (ADLS) Gen1 account on Databricks, it fails with the error: com.microsoft.azure.datalake.store.ADLException: Error creating directory / Error fetching access token Operation null failed with exception java.io.IOException : Server returned HTTP response code: 401 for URL: https://login.windows....
Failed credential validation checks error with Terraform
Problem You are using Terraform to deploy a workspace in AWS and you get a Failed credential validation checks error message. │ Error: MALFORMED_REQUEST: Failed credential validation checks: please use a valid cross account IAM role with permissions setup correctly │ │ with databricks_mws_credentials.this, │ on cross-account-role.tf line 29, in ...
Deployment name prefix not defined in Terraform
Problem You are using Terraform to deploy a workspace in AWS and you get a Deployment name cannot be used until a deployment name prefix is defined. error message. │ Error: MALFORMED_REQUEST: Failed parameter validation checks: Deployment name cannot be used until a deployment name prefix is defined. Please contact your Databricks representative │ │...
Configure Simba ODBC driver with a proxy in Windows
In this article you learn how to configure the Databricks ODBC Driver when your local Windows machine is behind a proxy server. Download the Simba driver for Windows Download and install the latest version of the Databricks ODBC Driver for Windows. Add proxy settings to the Windows registry Open the Windows registry and add the proxy settings to the...
Troubleshooting JDBC and ODBC connections
DBR Version: <list all applicable DBR versions> Cloud Version: AWS, Azure, GCP Author: <Databricks email of author> Owning Team: <Region + Platform/Spark> Ticket URL: <Link to original Salesforce or Jira ticket> Last reviewed date: May 05, 2021 This article provides information to help you troubleshoot the connection between ...
Enable OpenJSSE and TLS 1.3
Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is not encrypted. If you require that data is encrypted at all times, you can encrypt traffic between cluster worker nodes using AES 128 over a TLS 1.2 connection. In some cases, you may want to use TLS 1.3 i...
How to calculate the number of cores in a cluster
You can view the number of cores in a Databricks cluster in the Workspace UI using the Metrics tab on the cluster details page. Note Azure Databricks cluster nodes must have a metrics service installed. If the driver and executors are of the same node type, you can also determine the number of cores available in a cluster programmatically, using Sca...
Install a private PyPI repo
Certain use cases may require you to install libraries from private PyPI repositories. If you are installing from a public repository, you should review the library documentation. This article shows you how to configure an example init script that authenticates and downloads a PyPI library from a private repository. Create init script Create (or ver...
IP access list update returns INVALID_STATE
Problem You are trying to update an IP access list and you get an INVALID_STATE error message. {"error_code":"INVALID_STATE","message":"Your current IP 3.3.3.3 will not be allowed to access the workspace under current configuration"} Cause The IP access list update that you are trying to commit does not include your current public IP address. If you...
Launch fails with Client.InternalError
Problem You deploy a new E2 workspace, but you get cluster launch failures with the message Client.InternalError. Cause You have encryption of the EBS volumes at the AWS account level or you are using a custom KMS key for EBS encryption. Either one of these scenarios can result in a Client.InternalError when you try to create a cluster in an E2 work...
Cannot apply updated cluster policy
Problem You are attempting to update an existing cluster policy, however the update does not apply to the cluster associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not applied or saved. Cause This is a known issue that is being addressed. Solution You can use a workaround until a permanent fix ...
Cluster Apache Spark configuration not applied
Problem Your cluster’s Spark configuration values are not applied. Cause This happens when the Spark config values are declared in the cluster configuration as well as in an init script. When Spark config values are located in more than one place, the configuration in the init script takes precedence and the cluster ignores the configuration setting...
Cluster failed to launch
This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs. Cluster timeout Error messages: Driver failed to start in time INTERNAL_ERROR: The Spark driver failed to start within 300 seconds Cluster failed to be healthy within 200 seconds Cau...
Custom Docker image requires root
Problem You are trying to launch a Databricks cluster with a custom Docker container, but cluster creation fails with an error. { "reason": { "code": "CONTAINER_LAUNCH_FAILURE", "type": "SERVICE_FAULT", "parameters": { "instance_id": "i-xxxxxxx", "databricks_error_message": "Failed to launch spark container on instance i-xxxx. Exception: Could not a...
Job fails due to cluster manager core instance request limit
Problem A Databricks Notebook or Job API returns the following error: Unexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was rejected due to API rate limit. Please retry your request later, or choose a larger node type instead. Cause The error indicates the Cluster Manager Service core instance requ...
Admin user cannot restart cluster to run job
Problem When a user who has permission to start a cluster, such as a Databricks Admin user, submits a job that is owned by a different user, the job fails with the following message: Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The error received from the cluster manager was: 'You are no...
Cluster fails to start with dummy does not exist error
Problem You try to start a cluster, but it fails to start. You get an Apache Spark error message. Internal error message: Spark error: Driver down You review the cluster driver and worker logs and see an error message containing java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist. 21/07/14 21:44:06 ERROR DriverDaemon$: X...
Cluster slowdown due to Ganglia metrics filling root partition
Note This article applies to Databricks Runtime 7.3 LTS and below. Problem Clusters start slowing down and may show a combination of the following symptoms: Unhealthy cluster events are reported: Request timed out. Driver is temporarily unavailable. Metastore is down. DBFS is down. You do not see any high GC events or memory utilization associated w...
Failed to create cluster with invalid tag value
Problem You are trying to create a cluster, but it is failing with an invalid tag value error message. System.Exception: Content={"error_code":"INVALID_PARAMETER_VALUE","message":"\nInvalid tag value (<<<<TAG-VALUE>>>>) - the length cannot exceed 256\nUnicode characters in UTF-8.\n "} Cause Limitations on tag Key and Value ar...
Failed to expand the EBS volume
Problem Databricks jobs fail, due to a lack of space on the disk, even though storage auto-scaling is enabled. When you review the cluster event log, you see a message stating that the instance failed to expand disk due to an authorization error. Instance i-xxxxxxxxx failed to expand disk because: You are not authorized to perform this operation. En...
EBS leaked volumes
Problem After a cluster is terminated on AWS, some EBS volumes are not deleted automatically. These stray, unattached EBS volumes are often referred to as “leaked” volumes. Cause Databricks always sets DeletionOnTermination=true for the EBS volumes it creates when it launches clusters. Therefore, whenever a cluster instance is terminated, AWS should...
Log delivery fails with AssumeRole
Problem You are using AssumeRole to send cluster logs to a S3 bucket in another account and you get an access denied error. Cause AssumeRole does not allow you to send cluster logs to a S3 bucket in another account. This is because the log daemon runs on the host machine. It does not run inside the container. Only items that run inside the container...
Multi-part upload failure
Problem You observe a job failure with the exception: com.amazonaws.SdkClientException: Unable to complete multi-part upload. Individual part upload failed : Unable to execute HTTP request: Timeout waiting for connection from pool org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool ... com.amazonaws.http.Ama...
Persist Apache Spark CSV metrics to a DBFS location
Spark has a configurable metrics system that supports a number of sinks, including CSV files. In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location. Create an init script All of the configuration is done in an init script. The init script does the following thre...
Replay Apache Spark events in a cluster
The Spark UI is commonly used as a debugging tool for Spark jobs. If the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay notebook to replay the Spark events. Warning Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise there ...
S3 connection fails with "No role specified and no roles available"
Problem You are using Databricks Utilities (dbutils) to access a S3 bucket, but it fails with a No role specified and no roles available error. You have confirmed that the instance profile associated with the cluster has the permissions needed to access the S3 bucket. Unable to load AWS credentials from any provider in the chain: [com.databricks.bac...
Set Apache Hadoop core-site.xml properties
You have a scenario that requires Apache Hadoop properties to be set. You would normally do this in the core-site.xml file. In this article, we explain how you can set core-site.xml in a cluster. Create the core-site.xml file in DBFS You need to create a core-site.xml file and save it to DBFS on your cluster. An easy way to create this file is via a...
Set executor log level
Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...
Set instance_profile_arn as optional with a cluster policy
In this article, we review the steps to create a cluster policy for the AWS attribute instance_profile_arn and define it as optional. This allows you to start a cluster with a specific AWS instance profile. You can also start a cluster without an instance profile. Note You must be an admin user in order to manage cluster policies. Create a new clust...
Apache Spark job doesn’t start
Problem No Spark jobs start, and the driver logs contain the following error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Cause This error can occur when the executor memory and number of executor cores are set explicitly on the Spark Config tab. Here is a samp...
Auto termination is disabled when starting a job cluster
Problem You are trying to start a job cluster, but the job creation fails with an error message. Error creating job Cluster autotermination is currently disabled. Cause Job clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination policies. If you include autotermination_minutes in your cluster poli...
Unexpected cluster termination
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for remediation. Databricks ini...
How to configure single-core executors to run JNI libraries
When you create a cluster, Databricks launches one Apache Spark executor instance per worker node, and the executor uses all of the cores on the node. In certain situations, such as if you want to run non-thread-safe JNI libraries, you might need an executor that has only one core or task slot, and does not attempt to run concurrent tasks. In this c...
How to overwrite log4j configurations on Databricks clusters
Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...
Apache Spark executor memory allocation
By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM) memory heap. This is controlled by the spark.executor.memory property. However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As JVMs scale up in memory size, issues with the garbage collecto...
Apache Spark UI shows less than total node memory
Problem The Executors tab in the Spark UI shows less memory than is actually available on the node: AWS An m4.xlarge instance (16 GB ram, 4 core) for the driver node, shows 4.5 GB memory on the Executors tab. An m4.large instance (8 GB ram, 2 core) for the driver node, shows 710 MB memory on the Executors tab: Azure An F8s instance (16 GB, 4 core) f...
Configure a cluster to use a custom NTP server
By default Databricks clusters use public NTP servers. This is sufficient for most use cases, however you can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a private NTP server under your control. A common use case is to minimize the amount of Internet traffic from your cluster. Update the NT...
Enable GCM cipher suites
Databricks clusters using Databricks Runtime 9.1 LTS and below do not have GCM (Galois/Counter Mode) cipher suites enabled by default. You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher suites. Info This article applies to clusters using Databricks Runtime 7.3 LTS and 9.1 LTS. Databricks Runti...
Enable retries in init script
Init scripts are commonly used to configure Databricks clusters. There are some scenarios where you may want to implement retries in an init script. Example init script This sample init script shows you how to implement a retry for a basic copy operation. You can use this sample code as a base for implementing retries in your own init script. %scala...
Cannot read audit logs due to duplicate columns
Problem You are trying to read a udit logs and get an AnalysisException: Found duplicate column(s) error. spark.read.format("json").load("dbfs://mnt/logs/<path-to-logs>/date=2021-12-07") // AnalysisException: Found duplicate column(s) in the data schema: `<some_column>` Cause From November 2021 to December 2021, a limited number of Data...
Cannot set a custom PYTHONPATH
Problem When you are trying to set a custom PYTHONPATH environment variable in a cluster-scoped init script, but the values are overridden at driver startup. Cause Setting a custom PYTHONPATH in an init scripts does not work and is not supported. Additionally, you cannot set a custom PYTHONPATH when using Databricks Container Services. Solution You...
Use iptables to access the EC2 metadata server
Problem You are trying to access the metadata server for your EC2 instance but cannot connect. Cause This is the intended, default behavior. It is functioning as designed. Solution Use an init script to apply a custom iptables configuration to your Databricks cluster which enables access to the metadata server. Define a location to store the init sc...
Run a custom Databricks runtime on your cluster
The majority of Databricks customers use production Databricks runtime releases (AWS | Azure | GCP) for their clusters. However, there may be certain times when you are asked to run a custom Databricks runtime after raising a support ticket. Warning Custom Databricks runtime images are created for specific, short-term fixes and edge cases. If a cust...
Cluster init script fails with mirror sync in progress error
Problem You are using a custom init script running at cluster start to install a custom library. It works most of the time, but you encounter intermittent failures when apt-get update runs in the init script. The failures return a Mirror sync in process error message. Failed to fetch https://repos.<site>.com/zulu/deb/dists/stable/main/binary-a...
Use audit logs to identify who deleted a cluster
By default, all-purpose cluster configurations are deleted 30 days after the cluster was last terminated. It is possible to keep a cluster configuration for longer than 30 days if an administrator pins the cluster. In either situation, it is possible for an administrator to manually delete a cluster configuration at any time. If you try to run a job...
Pin cluster configurations using the API
Normally, cluster configurations are automatically deleted 30 days after the cluster was last terminated. If you want to keep specific cluster configurations, you can pin them. Up to 100 clusters can be pinned. Pinned clusters are not automatically deleted, however they can be manually deleted. Info You must be a Databricks administrator to pin a cl...
Unpin cluster configurations using the API
Normally, cluster configurations are automatically deleted 30 days after the cluster was last terminated. If you want to keep specific cluster configurations, you can pin them. Up to 100 clusters can be pinned. If you not longer need a pinned cluster, you can unpin it. If you have pinned 100 clusters, you must unpin a cluster before you can pin anot...
R commands fail on custom Docker cluster
Problem You are trying to run R notebooks on a custom Docker cluster (AWS | Azure), but they immediately fail. When you try to execute an R notebook, it returns an error saying the notebook was cancelled. When you review the Cluster driver and worker logs (AWS | Azure) you see a there is no package called 'Rserve' error. Tue Aug 30 16:24:34 UTC 2022...
Append to a DataFrame
To append to a DataFrame, use the union method. %scala val firstDF = spark.range(3).toDF("myCol") val newRow = Seq(20) val appended = firstDF.union(newRow.toDF()) display(appended) %python firstDF = spark.range(3).toDF("myCol") newRow = spark.createDataFrame([[20]]) appended = firstDF.union(newRow) display(appended)...
How to improve performance with bucketing
Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and s...
How to handle blob data contained in an XML file
If you log events in XML format, then every XML event is recorded as a base64 string. In order to run analytics on this data using Apache Spark, you need to use the spark_xml library and the BASE64DECODER API to transform the data for analysis. Problem You need to analyze base64-encoded strings from an XML-formatted log file using Spark. For example...
Simplify chained transformations
Sometimes you may need to perform multiple transformations on your DataFrame: %scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.DataFrame val testDf = (1 to 10).toDF("col") def func0(x: Int => Int, y: Int)(in: DataFrame): DataFrame = { in.filter('col > x(y)) } def func1(x: Int)(in: DataFrame): DataFrame = { in.sele...
How to dump tables in CSV, JSON, XML, text, or HTML format
You want to send results of your computations in Databricks outside Databricks. You can use BI tools to connect to your cluster via JDBC and export results from the BI tools, or save your tables in DBFS or blob storage and copy the data via REST API. This article introduces JSpark, a simple console tool for executing SQL queries using JDBC on Spark ...
Get and set Apache Spark configuration properties in a notebook
In most cases, you set the Spark config (AWS | Azure ) at the cluster level. However, there may be instances when you need to check (or set) the values of specific Spark configuration properties in a notebook. This article shows you how to display the current value of a Spark configuration property in a notebook. It also shows you how to set a new v...
Hive UDFs
This article shows how to create a Hive UDF, register it in Spark, and use it in a Spark SQL query. Here is a Hive UDF that takes a long as an argument and returns its hexadecimal representation. %scala import org.apache.hadoop.hive.ql.exec.UDF import org.apache.hadoop.io.LongWritable // This UDF takes a long integer and converts it to a hexadecimal...
Prevent duplicated columns when joining two DataFrames
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This article and notebook demonstrate how to perform a join so that you don’t have duplicated columns. Join on columns If you join on columns, you get duplicated columns. Scala %scala val llist...
Revoke all user privileges
When user permissions are explicitly granted for individual tables and views, the selected user can access those tables and views even if they don’t have permission to access the underlying database. If you want to revoke a user’s access, you can do so with the REVOKE command. However, the REVOKE command is explicit, and is strictly scoped to the ob...
How to list and delete files faster in Databricks
Scenario Suppose you need to delete a table that is partitioned by year, month, date, region, and service. However, the table is huge, and there will be around 1000 part files per partition. You can list all the files in each partition and then delete them using an Apache Spark job. For example, suppose you have a table that is partitioned by a, b, ...
How to handle corrupted Parquet files with different schema
Problem Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. You want to read only those files that match a specific schema and skip the files that don’t match. One solution could be to read the files in sequence, identify the schema, and union the DataFrames together. However, this approach ...
No USAGE permission on database
Problem You are using a cluster running Databricks Runtime 7.3 LTS and above. You have enabled table access control for your workspace (AWS | Azure | GCP) as the admin user, and granted the SELECT privilege to a standard user-group that needs to access the tables. A user tries to access an object in the database and gets a SecurityException error me...
Nulls and empty strings in a partitioned column save as nulls
Problem If you save data containing both empty strings and null values in a column on which the table is partitioned, both values become null after writing and reading the table. To illustrate this, create a simple DataFrame: %scala import org.apache.spark.sql.types._ import org.apache.spark.sql.catalyst.encoders.RowEncoder val data = Seq(Row(1, "")...
Behavior of the randomSplit method
When using randomSplit on a DataFrame, you could potentially observe inconsistent behavior. Here is an example: %python df = spark.read.format('inconsistent_data_source').load() a,b = df.randomSplit([0.5, 0.5]) a.join(broadcast(b), on='id', how='inner').count() Typically this query returns 0. However, depending on the underlying data source or input...
Job fails when using Spark-Avro to write decimal values to AWS Redshift
Problem In Databricks Runtime versions 5.x and above, when writing decimals to Amazon Redshift using Spark-Avro as the default temp file format, either the write operation fails with the exception: Error (code 1207) while loading data into Redshift: "Invalid digit, Value '"', Pos 0, Type: Decimal" or the write operation writes nulls in place of the ...
Generate schema from case class
Spark provides an easy way to generate a schema from a Scala case class. For case class A, use the method ScalaReflection.schemaFor[A].dataType.asInstanceOf[StructType]. For example: %scala import org.apache.spark.sql.types.StructType import org.apache.spark.sql.catalyst.ScalaReflection case class A(key: String, time: java.sql.Timestamp, date: java....
How to specify skew hints in dataset and DataFrame-based join commands
When you perform a join command with DataFrame or Dataset objects, if you find that the query is stuck on finishing a small number of tasks due to data skew, you can specify the skew hint with the hint("skew") method: df.hint("skew"). The skew join optimization (AWS | Azure | GCP) is performed on the DataFrame for which you specify the skew hint. In...
How to update nested columns
Spark doesn’t support adding new columns or dropping existing columns in nested structures. In particular, the withColumn and drop methods of the Dataset class don’t allow you to specify a column name different from any top level columns. For example, suppose you have a dataset with the following schema: %scala val schema = (new StructType) .a...
Incompatible schema in some files
Problem The Spark job fails with an exception like the following while reading Parquet files: Error in SQL statement: SparkException: Job aborted due to stage failure: Task 20 in stage 11227.0 failed 4 times, most recent failure: Lost task 20.3 in stage 11227.0 (TID 868031, 10.111.245.219, executor 31): java.lang.UnsupportedOperationException: org.a...
Access denied when writing to an S3 bucket using RDD
Problem Writing to an S3 bucket using RDDs fails. The driver node can write, but the worker (executor) node returns an access denied error. Writing with the DataFrame API, however works fine. For example, let’s say you run the following code: %scala import java.io.File import java.io.Serializable import org.apache.spark.{SparkConf, SparkContext} imp...
Invalid timestamp when loading data into Amazon Redshift
Problem When you use a spark-redshift write operation to save timestamp data to Amazon Redshift, the following error can occur if that timestamp data includes timezone information. Error (code 1206) while loading data into Redshift: "Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SSOF]" Cause The Redshift table is using the Timestamp data typ...
Unable to infer schema for ORC error
Problem You are trying to read ORC files from a directory when you get an error message: org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually. Cause An Unable to infer the schema for ORC error occurs when the schema is not defined and Apache Spark cannot infer the schema due to: An empty directory. Us...
Object ownership is getting changed on dropping and recreating tables
Problem Ownership of SQL objects changes after dropping and recreating them. This can result in job failures due to permission issues. Cause In Databricks Runtime 7.3 LTS, when jobs are run with table ACLs turned off, any action that drops and recreates tables or views preserves the table ACLs that was set the last time the job was run with table AC...
User does not have permission SELECT on ANY File
Problem You are trying to create an external hive table, but keep getting a User does not have permission SELECT on any file error message. java.lang.SecurityException: User does not have permission SELECT on any file. Table access control (AWS | Azure | GCP) is enabled your cluster and you are not an admin. Cause The Databricks SQL query analyzer e...
Create tables on JSON datasets
In this article we cover how to create a table on JSON datasets using SerDe. Download the JSON SerDe JAR Open the hive-json-serde 1.3.8 download page. Click on json-serde-1.3.8-jar-with-dependencies.jar to download the file json-serde-1.3.8-jar-with-dependencies.jar. Info You can review the Hive-JSON-Serde GitHub repo for more information on the JAR...
Delete table when underlying S3 bucket is deleted
Problem You are trying to drop or alter a table when you get an error. Error in SQL statement: IOException: Bucket_name … does not exist You can reproduce the error with a DROP TABLE or ALTER TABLE command. %sql DROP TABLE <database-name.table-name>; %sql ALTER TABLE <database-name.table-name> SET LOCATION "<file-system-location>";...
Failure when mounting or accessing Azure Blob storage
Problem When you try to access an already created mount point or create a new mount point, it fails with the error: WASB: Fails with java.lang.NullPointerException Cause This error can occur when the root mount path (such as /mnt/) is also mounted to blob storage. Run the following command to check if the root path is also mounted: %python dbutils.f...
Unable to read files and list directories in a WASB filesystem
Problem When you try reading a file on WASB with Spark, you get the following exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 19, 10.139.64.5, executor 0): shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.a...
Optimize read performance from JDBC data sources
Problem Reading data from an external JDBC database is slow. How can I improve read performance? Solution See the detailed discussion in the Databricks documentation on how to optimize performance when reading data (AWS | Azure | GCP) from an external JDBC database....
Troubleshooting JDBC/ODBC access to Azure Data Lake Storage Gen2
Problem Info In general, you should use Databricks Runtime 5.2 and above, which include a built-in Azure Blob File System (ABFS) driver, when you want to access Azure Data Lake Storage Gen2 (ADLS Gen2). This article applies to users who are accessing ADLS Gen2 storage using JDBC/ODBC instead. When you run a SQL query from a JDBC or ODBC client to ac...
CosmosDB-Spark connector library conflict
This article explains how to resolve an issue running applications that use the CosmosDB-Spark connector in the Databricks environment. Problem Normally if you add a Maven dependency to your Spark cluster, your app should be able to use the required connector libraries. But currently, if you simply specify the CosmosDB-Spark connector’s Maven co-ord...
Failure to detect encoding in JSON
Problem Spark job fails with an exception containing the message: Invalid UTF-32 character 0x1414141(above 10ffff) at char #1, byte #7) At org.apache.spark.sql.catalyst.json.JacksonParser.parse Cause The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. However, BOM is not ...
Inconsistent timestamp results with JDBC applications
Problem When using JDBC applications with Databricks clusters you see inconsistent java.sql.Timestamp results when switching between standard time and daylight saving time. Cause Databricks clusters use UTC by default. java.sql.Timestamp uses the JVM’s local time zone. If a Databricks cluster returns 2021-07-12 21:43:08 as a string, the JVM parses i...
Kafka client terminated with OffsetOutOfRangeException
Problem You have an Apache Spark application that is trying to fetch messages from an Apache Kafka source when it is terminated with a kafkashaded.org.apache.kafka.clients.consumer.OffsetOutOfRangeException error message. Cause Your Spark application is trying to fetch expired data offsets from Kafka. We generally see this in these two scenarios: Sc...
Apache Spark JDBC datasource query option doesn’t work for Oracle database
Problem When you use the query option with the Apache Spark JDBC datasource to connect to an Oracle Database, it fails with this error: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character For example, if you run the following to make a JDBC connection: %scala val df = spark.read .format("jdbc") .option("url", "<url>") .option(...
Accessing Redshift fails with NullPointerException
Problem Sometimes when you read a Redshift table: %scala val original_df = spark.read. format("com.databricks.spark.redshift"). option("url", url). option("user", user). option("password", password). option("query", query). option("forward_spark_s3_credentials", true). option("tempdir", "path"). load()...
Redshift JDBC driver conflict issue
Problem If you attach multiple Redshift JDBC drivers to a cluster, and use the Redshift connector, the notebook REPL might hang or crash with a SQLDriverWrapper error message. 19/11/14 01:01:44 ERROR SQLDriverWrapper: Fatal non-user error thrown in ReplId-9d455-9b970-b2042 java.lang.NoSuchFieldError: PG_SUBPROTOCOL_NAMES at com.amazon.redshi...
ABFS client hangs if incorrect client ID or wrong path used
Problem You are using Azure Data Lake Storage (ADLS) Gen2. When you try to access an Azure Blob File System (ABFS) path from a Databricks cluster, the command hangs. Enable the debug log and you can see the following stack trace in the driver logs: Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: https://login.microso...
Recursive references in Avro schema are not allowed
Problem Apache Spark returns an error when trying to read from an Apache Avro data source if the Avro schema has a recursive reference. org.apache.spark.sql.avro.IncompatibleSchemaException: Found recursive reference in Avro schema, which can not be processed by Spark Cause Spark SQL does not support recursive references in an Avro data source becau...
SQL access control error when using Snowflake as a data source
Problem The Snowflake Connector for Spark is used to read data from, and write data to, Snowflake while working in Databricks. The connector makes Snowflake look like another Spark data source. When you try to query Snowflake, your get a SnowflakeSQLException error message. SnowflakeSQLException: SQL access control error: Insufficient privileges to ...
Cannot access objects written by Databricks from outside Databricks
Problem When you attempt to access an object in an S3 location written by Databricks using the AWS CLI, the following error occurs: ubuntu@0213-174944-clean111-10-93-15-150:~$ aws s3 cp s3://<bucket>/<location>/0/delta/sandbox/deileringDemo__m2/_delta_log/00000000000000000000.json . fatal error: An error occurred (403) when calling the H...
Cannot read Databricks objects stored in the DBFS root directory
Problem An Access Denied error returns when you attempt to read Databricks objects stored in the DBFS root directory in blob storage from outside a Databricks cluster. Cause This is normal behavior for the DBFS root directory. Databricks stores objects like libraries and other temporary system files in the DBFS root directory. Databricks is the only...
How to calculate the Databricks file system (DBFS) S3 API call cost
The cost of a DBFS S3 bucket is primarily driven by the number of API calls, and secondarily by the cost of storage. You can use the AWS CloudTrail logs to create a table, count the number of API calls, and thereby calculate the exact cost of the API requests. Obtain the following information. You may need to contact your AWS Administrator to get it...
How to specify the DBFS path
When working with Databricks you will sometimes have to access the Databricks File System (DBFS). Accessing files on DBFS is done with standard filesystem commands, however the syntax varies depending on the language or tool used. For example, take the following DBFS path: dbfs:/mnt/test_folder/test_folder1/ Apache Spark Under Spark, you should spec...
Operation not supported during append
Problem You are attempting to append data to a file saved on an external storage mount point and are getting an error message: OSError: [Errno 95] Operation not supported. The error occurs when trying to append to a file from both Python and R. Cause Direct appends and random writes are not supported in FUSE v2, which is available in Databricks Runt...
Parallelize filesystem operations
DBR Version: <list all applicable DBR versions> Cloud Version: AWS, Azure Author: sandeep.chandran@databricks.com Owning Team: <Region + Platform/Spark> Ticket URL: <Link to original Salesforce or Jira ticket> Last reviewed date: July 21, 2021 - Ashish Singh When you need to speed up copy and move operations, parallelizing them is ...
S3 connection reset error
Problem Your Apache Spark job fails when attempting an S3 operation. The error message Caused by: java.net.SocketException: Connection reset appears in the stack trace. Example stack trace from an S3 read operation: Caused by: javax.net.ssl.SSLException: Connection reset; Request ID: XXXXX, Extended Request ID: XXXXX, Cloud Provider: AWS, Instance I...
Upload large files using DBFS API 2.0 and PowerShell
Using the Databricks REST API to interact with your clusters programmatically can be a great way to streamline workflows with scripts. The API can be called with various tools, including PowerShell. In this article, we are going to take a look at an example DBFS put command using curl and then show you how to execute that same command using PowerShe...
Null column values display as NaN
Problem You have a table with null values in some columns. When you query the table using a select statement in Databricks, the null values appear as null. When you query the table using the same select statement in Databricks SQL, the null values appear as NaN. %sql select * from default.<table-name> where <column-name> is null Databric...
Retrieve queries owned by a disabled user
When a Databricks SQL user is removed from an organization, the queries owned by the user remain, but they are only visible to those who already have permission to access them. A Databricks SQL admin can transfer ownership to other users, as well as delete alerts, dashboards, and queries owned by the disabled user account. Clone a query A Databricks...
Job timeout when connecting to a SQL endpoint over JDBC
Problem You have a job that is reading and writing to an SQL endpoint over a JDBC connection. The SQL warehouse fails to execute the job and you get a java.net.SocketTimeoutException: Read timed out error message. 2022/02/04 17:36:15 - TI_stg_trade.0 - Caused by: com.simba.spark.jdbc42.internal.apache.thrift.transport.TTransportException: java.net.S...
Slowness when fetching results in Databricks SQL
Problem Databricks SQL uses cloud fetch to increase query performance. This is done by default. Instead of using single threaded queries, cloud fetch retrieves data in parallel from cloud storage buckets (such as AWS S3 and Azure Data Lake Storage). Compared to a standard, single threaded fetch, you can see up to a 10X increase in performance using ...
Apache Spark session is null in DBConnect
Problem You are trying to run your code using Databricks Connect ( AWS | Azure | GCP ) when you get a sparkSession is null error message. java.lang.AssertionError: assertion failed: sparkSession is null while trying to executeCollectResult at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(...
Databricks Connect reports version error with Databricks Runtime 6.4
Problem You are using the Databricks Connect client with Databricks Runtime 6.4 and receive an error message which states that the client does not support the cluster. Caused by: java.lang.IllegalArgumentException: The cluster is running server version `dbr-6.4` but this client only supports Set(dbr-5.5). You can find a list of client releases at ht...
Failed to create process error with Databricks CLI in Windows
Problem While trying to access the Databricks CLI (AWS | Azure | GCP) in Windows, you get a failed to create process error message. Cause This can happen: If multiple instances of the Databricks CLI are installed on the system. If the Python path on your Windows system includes a space. Info There is a known issue in pip which causes pip installed s...
GeoSpark undefined function error with DBConnect
Problem You are trying to use the GeoSpark function st_geofromwkt with DBConnect (AWS | Azure | GCP) and you get an Apache Spark error message. Error: org.apache.spark.sql.AnalysisException: Undefined function: 'st_geomfromwkt'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; T...
Get Apache Spark config in DBConnect
You can always view the Spark configuration (AWS | Azure | GCP) for your cluster by reviewing the cluster details in the workspace. If you are using DBConnect (AWS | Azure | GCP) you may want to quickly review the current Spark configuration details without switching over to the workspace UI. This example code shows you how to get the current Spark ...
How to Sort S3 files By Modification Time in Databricks Notebooks
Problem When you use the dbutils utility to list the files in a S3 location, the S3 files list in random order. However, dbutils doesn’t provide any method to sort the files based on their modification time. dbutils doesn’t list a modification time either. Solution Use the Hadoop filesystem API to sort the S3 files, as shown here: %scala import org....
Invalid Access Token error when running jobs with Airflow
Problem When you run scheduled Airflow Databricks jobs, you get this error: Invalid Access Token : 403 Forbidden Error Cause To run or schedule Databricks jobs through Airflow, you need to configure the Databricks connection using the Airflow web UI. Any of the following incorrect settings can cause the error: Set the host field to the Databricks wo...
ProtoSerializer stack overflow error in DBConnect
Problem You are using DBConnect (AWS | Azure | GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error. py4j.protocol.Py4JJavaError: An error occurred while calling o945.count. : java.lang.StackOverflowError at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Clas...
Use tcpdump to create pcap files
If you want to analyze the network traffic between nodes on a specific cluster, you can install tcpdump on the cluster and use it to dump the network packet details to pcap files. The pcap files can then be downloaded to a local machine for analysis. Create the tcpdump init script Run this sample script in a notebook on the cluster to create the ini...
Terraform registry does not have a provider error
Problem You are installing the Databricks Terraform provider ( AWS | Azure | GCP) and get a Databricks provider registry error. Error while installing hashicorp/databricks: provider registry registry.terraform.io does not have a provider named registry.terraform.io/hashicorp/databricks Cause This error occurs when the required_providers block is not...
A file referenced in the transaction log cannot be found
Problem Your job fails with an error message: A file referenced in the transaction log cannot be found. Example stack trace: Error in SQL statement: SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 106, XXX.XXX.XXX.XXX, executor 0): com.databricks.sql.io.FileRe...
Cannot delete data using JDBC in Eclipse
Problem You cannot delete data from a Delta table using JDBC from your local Eclipse environment. The same delete operation works when run in a notebook. You can connect to Databricks from your local environment. A select query works. Integer count = namedlocalJdbcTemplate.queryForObject("SELECT COUNT(*) FROM <table-name> ", new MapSqlParamete...
Compare two versions of a Delta table
Delta Lake supports time travel, which allows you to query an older snapshot of a Delta table. One common use case is to compare two versions of a Delta table in order to identify what changed. For more details on time travel, please review the Delta Lake time travel documentation (AWS | Azure | GCP). Identify all differences You can use a SQL SELEC...
Converting from Parquet to Delta Lake fails
Problem You are attempting to convert a Parquet file to a Delta Lake file. The directory containing the Parquet file contains one or more subdirectories. The conversion fails with the error message: Expecting 0 partition column(s): [], but found 1 partition column(s): [<column_name>] from parsing the file name: <path_to_the_file_location>...
Delta Merge cannot resolve nested field
Problem You are attempting a Delta Merge with automatic schema evolution, but it fails with a Delta Merge: cannot resolve 'field' due to data type mismatch error message. Cause This can happen if you have made changes to the nested column fields. For example, assume we have a column called Address with the fields streetName, houseNumber, and city ne...
Delete your streaming query checkpoint and restart
Problem Your job fails with a Delta table <value> doesn't exist. Please delete your streaming query checkpoint and restart. error message. Cause Two different streaming sources are configured to use the same checkpoint directory. This is not supported. For example, assume streaming query A streams data from Delta table A, and uses the director...
How Delta cache behaves on an autoscaling cluster
This article is about how Delta cache (AWS | Azure | GCP) behaves on an auto-scaling cluster, which removes or adds nodes as needed. When a cluster downscales and terminates nodes: A Delta cache behaves in the same way as an RDD cache. Whenever a node goes down, all of the cached data in that particular node is lost. Delta cache data is not moved fr...
How to improve performance of Delta Lake MERGE INTO queries using partition pruning
This article explains how to trigger partition pruning in Delta Lake MERGE INTO (AWS | Azure | GCP) queries from Databricks. Partition pruning is an optimization technique to limit the number of partitions that are inspected by a query. Discussion MERGE INTO is an expensive operation when used with Delta tables. If you don’t partition the underlying...
Best practices for dropping a managed Delta Lake table
Regardless of how you drop a managed table, it can take a significant amount of time, depending on the data size. Delta Lake managed tables in particular contain a lot of metadata in the form of transaction logs, and they can contain duplicate data files. If a Delta table has been in use for a long time, it can accumulate a very large amount of data...
HIVE_CURSOR_ERROR when reading a table in Athena
Problem You create an external table in Athena and integrate it with Delta Lake using the instructions in the Presto and Athena to Delta Lake integration documentation. The external table in Athena is defined in the Apache Hive metastore. You run a select query on external table from the Athena Query Editor and it returns a HIVE_CURSOR_ERROR. HIVE_C...
Access denied when writing Delta Lake tables to S3
Problem Writing DataFrame contents in Delta Lake format to an S3 location can cause an error: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C827672D85516BA9; S3 Extended Request ID: Cause A write operation involving the Delta Lake format requires permissions...
Delta Lake write job fails with java.lang.UnsupportedOperationException
Problem Delta Lake write jobs sometimes fail with the following exception: java.lang.UnsupportedOperationException: com.databricks.backend.daemon.data.client.DBFSV1.putIfAbsent(path: Path, content: InputStream). DBFS v1 doesn't support transactional writes from multiple clusters. Please upgrade to DBFS v2. Or you can disable multi-cluster writes by ...
How to populate or update columns in an existing Delta table
Problem You have an existing Delta table, with a few empty columns. You need to populate or update those columns with data from a raw Parquet file. Solution In this example, there is a customers table, which is an existing Delta table. It has an address column with missing values. The updated data exists in Parquet format. Create a DataFrame from th...
Identify duplicate data on append operations
A common issue when performing append operations on Delta tables is duplicate data. For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table. In this article, we review basic troubleshooting steps that you can use to...
Object lock error when writing Delta Lake tables to S3
Problem You are trying to perform a Delta write operation to a S3 bucket and get an error message. com.amazonaws.services.s3.model.AmazonS3Exception: Content-MD5 HTTP header is required for Put Part requests with Object Lock parameters Cause Delta Lake does not support S3 buckets with object lock enabled. Solution You should use an S3 bucket that do...
Optimize a Delta sink in a structured streaming application
You are using a Delta table as the sink for your structured streaming application and you want to optimize the Delta table so that queries are faster. If your structured streaming application has a very frequent trigger interval, it may not create sufficient files that are eligible for compaction in each microbatch. The autoOptimize operation compac...
Delta Lake UPDATE query fails with IllegalState exception
Problem When you execute a Delta Lake UPDATE, DELETE, or MERGE query that uses Python UDFs in any of its transformations, it fails with the following exception: AWS java.lang.UnsupportedOperationException: Error in SQL statement: IllegalStateException: File (s3a://xxx/table1) to be rewritten not found among candidate files: s3a://xxx/table1/part-000...
Unable to cast string to varchar
Problem You are trying to cast a string type column to varchar but it isn’t working. Info The varchar data type (AWS | Azure | GCP) is available in Databricks Runtime 8.0 and above. Create a simple Delta table, with one column as type string.%sql CREATE OR REPLACE TABLE delta_table1 (`col1` string) USING DELTA; Use SHOW TABLE on the newly created ta...
Vaccuming with zero retention results in data loss
Problem You add data to a Delta table, but the data disappears without warning. There is no obvious error message. Cause This can happen when spark.databricks.delta.retentionDurationCheck.enabled is set to false and VACUUM is configured to retain 0 hours. %sql VACUUM <name-of-delta-table> RETAIN 0 HOURS OR %sql VACUUM delta.`<delta_table_pa...
Z-Ordering will be ineffective, not collecting stats
Problem You are trying to optimize a Delta table by Z-Ordering and receive an error about not collecting stats for the columns. AnalysisException: Z-Ordering on [col1, col2] will be ineffective, because we currently do not collect stats for these columns. Info Please review Z-Ordering (multi-dimensional clustering) (AWS | Azure | GCP) for more infor...
Change cluster config for Delta Live Table pipeline
Problem You are using Delta Live Tables and want to change the cluster configuration. You create a pipeline, but only have options to enable or disable Photon and select the number of workers. Cause When you create a Delta Live Table pipeline, most parameters are configured with default values. These values cannot be configured before the pipeline i...
Different tables with same data generate different plans when used in same query
Problem Assume you have two Delta tables test_table_1 and test_table_2. Both tables have the same schema, same data volume, same partitions, and contain the same number of files. You are doing a join transformation with another Delta table, test_table_join, which has a million records. When you run the below join queries using test_table_1 and test_...
Allow spaces and special characters in nested column names with Delta tables
Problem It is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names. When you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an AnalysisException error message. For example, if you try to read a JSON file, evaluate the DataFrame, and ...
Delta writing empty files when source is empty
Problem Delta writes can result in the creation of empty files if the source is empty. This can happen with a regular Delta write or a MERGE INTO (AWS | Azure | GCP) operation. If your streaming application is writing to a target Delta table and your source data is empty on certain micro batches, it can result in writing empty files to your target D...
Delta Live Tables pipelines are not running VACUUM automatically
Problem Delta Live Tables supports auto-vacuum by default. You setup a Delta Live Tables pipeline, but notice VACUUM is not running automatically. Cause A Delta Live Tables pipeline needs a separate maintenance cluster configuration (AWS | Azure | GCP) inside the pipeline settings to ensure VACUUM runs automatically. If the maintenance cluster is n...
VACUUM best practices on Delta Lake
Why use VACUUM on Delta Lake? VACUUM is used to clean up unused and stale data files that are taking up unnecessary storage space. Removing these files can help reduce storage costs. When you run VACUUM on a Delta table it removes the following files from the underlying file system: Any data files that are not maintained by Delta Lake Removes stale...
OPTIMIZE is only supported for Delta tables error on Delta Lake
Problem You run OPTIMIZE on a Delta table and get an error message saying it is only supported on Delta tables. Error: `<database-name>`.`<table-name>`is not a Delta table. OPTIMIZE is only supported for Delta tables. Cause This can happen if the target table's storage location was modified and the table was recreated with a new storage ...
Distinguish active and dead jobs
Problem On clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages. Cause Whenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal eventListenerBus...
Spark job fails with Driver is temporarily unavailable
Problem A Databricks notebook returns the following error: Driver is temporarily unavailable This issue can be intermittent or not. A related error message is: Lost connection to cluster. The notebook may have been detached. Cause One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver cras...
How to delete all jobs using the REST API
Run the following commands to delete all jobs in a Databricks workspace. Identify the jobs to delete and list them in a text file:%sh curl -X GET -u "Bearer: <token>" https://<databricks-instance>/api/2.0/jobs/list | grep -o -P 'job_id.{0,6}' | awk -F':' '{print $2}' >> job_id.txt Run the curlcommand in a loop to delete the identif...
Identify less used jobs
The workspace has a limit on the number of jobs that can be shown in the UI. The current job limit is 1000. If you exceed the job limit, you receive a QUOTA_EXCEEDED error message. 'error_code':'QUOTA_EXCEEDED','message':'The quota for the number of jobs has been reached. The current quota is 1000. This quota is only applied to jobs created through ...
Job cluster limits on notebook output
Problem You are running a notebook on a job cluster and you get an error message indicating that the output is too large. The output of the notebook is too large. Cause: rpc response (of 20975548 bytes) exceeds limit of 20971520 bytes Cause This error message can occur in a job cluster whenever the notebook output is greater then 20 MB. If you are u...
Job fails, but Apache Spark tasks finish
Problem Your Databricks job reports a failed status, but all Spark jobs and tasks have successfully completed. Cause You have explicitly called spark.stop() or System.exit(0) in your code. If either of these are called, the Spark context is stopped, but the graceful shutdown and handshake with the Databricks job service does not happen. Solution Do ...
Job fails due to job rate limit
Problem A Databricks notebook or Jobs API request returns the following error: Error : {"error_code":"INVALID_STATE","message":"There were already 1000 jobs created in past 3600 seconds, exceeding rate limit: 1000 job creations per 3600 seconds."} Cause This error occurs because the number of jobs per hour exceeds the limit of 1000 established by Da...
Create table in overwrite mode fails when interrupted
Problem When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs: Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`'). The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.; Caus...
Apache Spark Jobs hang due to non-deterministic custom UDF
Problem Sometimes Apache Spark jobs hang indefinitely due to the non-deterministic behavior of a Spark User-Defined Function (UDF). Here is an example of such a function: %scala val convertorUDF = (commentCol: String) => { #UDF definition } val translateColumn = udf(convertorUDF) If you call this UDF using the withColumn() A...
Apache Spark job fails with Failed to parse byte string
Problem Spark-submit jobs fail with a Failed to parse byte string: -1 error message. java.util.concurrent.ExecutionException: java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m. Failed to parse byte string: -1 at java.util.concurre...
Apache Spark UI shows wrong number of jobs
Problem You are reviewing the number of active Apache Spark jobs on a cluster in the Spark UI, but the number is too high to be accurate. If you restart the cluster, the number of jobs shown in the Spark UI is correct at first, but over time it grows abnormally high. Cause The Spark UI is not always accurate for large, or long-running, clusters due ...
Apache Spark job fails with a Connection pool shut down error
Problem A Spark job fails with the error message java.lang.IllegalStateException: Connection pool shut down when attempting to write data into a Delta table on S3. Cause Spark jobs writing to S3 are limited to a maximum number of simultaneous connections. The java.lang.IllegalStateException: Connection pool shut down occurs when this connection pool...
Job fails with atypical errors message
Problem Your job run fails with a throttled due to observing atypical errors error message. Cluster became unreachable during run Cause: xxx-xxxxxx-xxxxxxx is throttled due to observing atypical errors Cause The jobs on this cluster have returned too many large results to the Apache Spark driver node. As a result, the chauffeur service runs out of m...
Apache Spark job fails with maxResultSize exception
Problem A Spark job fails with a maxResultSize exception: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB) Cause This error occurs because the configured size limit was exceeded. The size limit applies to the total serialized ...
Databricks job fails because library is not installed
Problem A Databricks job fails because the job requires a library that is not yet installed, causing Import errors. Cause The error occurs because the job starts running before required libraries install. If you run a job on a cluster in either of the following situations, the cluster can experience a delay in installing libraries: When you start an...
Job failure due to Azure Data Lake Storage (ADLS) CREATE limits
Problem When you run a job that involves creating files in Azure Data Lake Storage (ADLS), either Gen1 or Gen2, the following exception occurs: Caused by: java.io.IOException: CREATE failed with error 0x83090c25 (Files and folders are being created at too high a rate). [745c5836-264e-470c-9c90-c605f1c100f5] failed with error 0x83090c25 (Files and fo...
Job fails with invalid access token
Problem Long running jobs, such as streaming jobs, fail after 48 hours when using dbutils.secrets.get() (AWS | Azure | GCP). For example: %python streamingInputDF1 = ( spark .readStream .format("delta") .table("default.delta_sorce") ) def writeIntodelta(batchDF, batchId): table_name = dbutil...
How to ensure idempotency for jobs
When you submit jobs through the Databricks Jobs REST API, idempotency is not guaranteed. If the client request is timed out and the client resubmits the same request, you may end up with duplicate jobs running. To ensure job idempotency when you submit jobs through the Jobs API, you can use an idempotency token to define a unique value for a specif...
Monitor running jobs with a Job Run dashboard
The Job Run dashboard is a notebook that displays information about all of the jobs currently running in your workspace. To configure the dashboard, you must have permission to attach a notebook to an all-purpose cluster in the workspace you want to monitor. If an all-purpose cluster does not exist, you must have permission to create one. Once the d...
Streaming job has degraded performance
Problem You have a streaming job which has its performance degrade over time. You start a new streaming job with the same configuration and same source, and it performs better than the existing job. Cause Issues with old checkpoints can result in performance degradation in long running streaming jobs. This can happen if the job was intermittently ha...
Task deserialization time is high
Problem Your tasks are running slower than expected. You review the stage details in the Spark UI on your cluster and see that task deserialization time is high. Cause Cluster-installed libraries (AWS | Azure | GCP) are only installed on the driver when the cluster is started. These libraries are only installed on the executors when the first tasks ...
Pass arguments to a notebook as a list
There is no direct way to pass arguments to a notebook as a dictionary or list. You can work around this limitation by serializing your list as a JSON file and then passing it as one argument. After passing the JSON file to the notebook, you can parse it with json.loads(). Instructions Define the argument list and convert it to a JSON file. Start by...
Uncommitted files causing data duplication
Problem You had a network issue (or similar) while a write operation was in progress. You are rerunning the job, but partially uncommitted files during the failed run are causing unwanted data duplication. Cause How Databricks commit protocol works: The DBIO commit protocol (AWS | Azure | GCP) is transactional. Files are only committed after a trans...
Multi-task workflows using incorrect parameter values
Problem Using key-value parameters in a multi task workflow is a common use case. It is normal to have multiple tasks running in parallel and each task can have different parameter values for the same key. These key-value parameters are read within the code and used by each task. For example, assume you have four tasks: task1, task2, task3, and task...
Job fails with Spark Shuffle FetchFailedException error
Problem If your application contains any aggregation or join stages, the execution will require a Spark Shuffle stage. Depending on the specific configuration used, if you are running multiple streaming queries on an interactive cluster you may get a shuffle FetchFailedException error. ShuffleMapStage has failed the maximum allowable number of times...
Users unable to view job results when using remote Git source
Problem You are running a job using notebooks that are stored in a remote Git repository (AWS | Azure | GCP). Databricks users with Can View permissions (who are not a workspace admin or owners of the job) cannot access or view the results of ephemeral jobs submitted via dbutils.notebook.run() from parent notebook. Cause When job visibility control ...
Single scheduled job tries to run multiple times
Problem You schedule a job (AWS | Azure | GCP) to run once per day, using Quartz Cron Syntax, but the job tries to run multiple times on the same day. Cause When the job was configured, it was scheduled by manually entering the cron syntax and a special character * was accidentally set for the seconds value. This tells the cron scheduler to run the ...
Increase the number of tasks per stage
When using the spark-xml package, you can increase the number of tasks per stage by changing the configuration setting spark.hadoop.mapred.max.split.size to a lower value in the cluster’s Spark config (AWS | Azure ). This configuration setting controls the input block size. When data is read from DBFS, it is divided into input blocks, which are then...
Maximum execution context or notebook attachment limit reached
Problem Notebook or job execution stops and returns either of the following errors: Run result unavailable: job failed with error message Context ExecutionContextId(1731742567765160237) is disconnected. Can’t attach this notebook because the cluster has reached the attached notebook limit. Detach a notebook and retry. Cause When you attach a noteboo...
Serialized task is too large
If you see the follow error message, you may be able to fix this error by changing the Spark config (AWS | Azure ) when you start the cluster. Serialized task XXX:XXX was XXX bytes, which exceeds max allowed: spark.rpc.message.maxSize (XXX bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values. To change ...
Cannot import module in egg library
Problem You try to install an egg library to your cluster and it fails with a message that the a module in the library cannot be imported. Even a simple import fails. import sys egg_path='/dbfs/<path-to-egg-file>/<egg-file>.egg' sys.path.append(egg_path) import shap_master Cause This error message occurs due to the way the library is pac...
Cannot import TabularPrediction from AutoGluon
Problem You are trying to import TabularPrediction from AutoGluon, but are getting an error message. ImportError: cannot import name 'TabularPrediction' from 'autogluon' (unknown location) This happens when AutoGluon is installed via a notebook or as a cluster-installed library (AWS | Azure | GCP). You can reproduce the error by running the import c...
Latest PyStan fails to install on Databricks Runtime 6.4
Problem You are trying to install the PyStan PyPi package on a Databricks Runtime 6.4 Extended Support cluster and get a ManagedLibraryInstallFailed error message. java.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, pystan, --disable-pip-version-check) exited wit...
Library unavailability causing job failures
Problem You are launching jobs that import external libraries and get an Import Error. When a job causes a node to restart, the job fails with the following error message: ImportError: No module named XXX Cause The Cluster Manager is part of the Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R...
How to correctly update a Maven library in Databricks
Problem You make a minor update to a library in the repository, but you don’t want to change the version number because it is a small change for testing purposes. When you attach the library to your cluster again, your code changes are not included in the library. Cause One strength of Databricks is the ability to install third-party or custom libra...
Init script fails to download Maven JAR
Problem You have an init script that is attempting to install a library via Maven, but it fails when trying to download a JAR. https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.4.1/rapids-4-spark_2.12-0.4.1.jar%0D Resolving repo1.maven.org (repo1.maven.org)... 151.101.248.209 Connecting to repo1.maven.org (repo1.maven.org)|151.101.248....
Install package using previous CRAN snapshot
Problem You are trying to install a library package via CRAN, and are getting a Library installation failed for library due to infra fault error message. Library installation failed for library due to infra fault for Some(cran { package: "<name-of-package>" } ). Error messages: java.lang.RuntimeException: Installation failed with message: Erro...
Install PyGraphViz
PyGraphViz Python libraries are used to plot causal inference networks. If you try to install PyGraphViz as a standard library, it fails due to dependency errors. PyGraphViz has the following dependencies: python3-dev graphviz libgraphviz-dev pkg-config Install via notebook Install the dependencies with apt-get.%sh sudo apt-get install -y python3-de...
Install Turbodbc via init script
Turbodbc is a Python module that uses the ODBC interface to access relational databases. It has dependencies on libboost-all-dev, unixodbc-dev, and python-dev packages, which need to be installed in order. You can install these manually, or you can use an init script to automate the install. Create the init script Run this sample script in a noteboo...
Cannot uninstall library from UI
Problem Usually, libraries can be uninstalled in the Clusters UI. If the checkbox to select the library is disabled, then it’s not possible to uninstall the library from the UI. Cause If you create a library using REST API version 1.2 and if auto-attach is enabled, the library is installed on all clusters. In this scenario, the Clusters UI checkbox ...
Error when installing Cartopy on a cluster
Problem You are trying to install Cartopy on a cluster and you receive a ManagedLibraryInstallFailed error message. java.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, cartopy==0.17.0, --disable-pip-version-check) exited with code 1. ERROR: Command errored out ...
Error when installing pyodbc on a cluster
Problem One of the following errors occurs when you use pip to install the pyodbc library. java.lang.RuntimeException: Installation failed with message: Collecting pyodbc "Library installation is failing due to missing dependencies. sasl and thrift_sasl are optional dependencies for SASL or Kerberos support" Cause Although sasl and thrift_sasl are o...
Libraries fail with dependency exception
Problem You have a Python function that is defined in a custom egg or wheel file and also has dependencies that are satisfied by another customer package installed on the cluster. When you call this function, it returns an error that says the requirement cannot be satisfied. org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnv...
Libraries failing due to transient Maven issue
Problem Job fails because libraries cannot be installed. Library resolution failed. Cause: java.lang.RuntimeException: Cannot download some libraries due to transient Maven issue. Please try again later Cause After a Databricks upgrade, your cluster attempts to download any required libraries from Maven. After downloading, the libraries are stored a...
New job fails when adding a library from DBFS or S3
Problem You create a new job and attempt to add a library from DBFS or S3 storage. The workspace UI returns an error. Error: Uncaught TypeError: Cannot read property 'concat' of undefined Reload the page and try again. If the error persists, contact support. Clicking Reload page does not resolve the error message. Cause This is a known issue. A fix ...
Reading .xlsx files with xlrd fails
Problem You are have xlrd installed on your cluster and are attempting to read files in the Excel .xlsx format when you get an error. XLRDError: Excel xlsx file; not supported Cause xlrd 2.0.0 and above can only read .xls files. Support for .xlsx files was removed from xlrd due to a potential security vulnerability. Solution Use openpyxl to open .xl...
Remove Log4j 1.x JMSAppender and SocketServer classes from classpath
Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. Databricks also does not use the affected classes from Log4j 1.x with known vul...
Replace a default library jar
Databricks includes a number of default Java and Scala libraries. You can replace any of these libraries with another version by using a cluster-scoped init script to remove the default library jar and then install the version you require. Warning Removing default libraries and installing new versions may cause instability or completely break your D...
Python command fails with AssertionError: wrong color format
Problem You run a Python notebook and it fails with an AssertionError: wrong color format message. An example stack trace: File "/local_disk0/tmp/1599775649524-0/PythonShell.py", line 39, in <module> from IPython.nbconvert.filters.ansi import ansi2html File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<...
PyPMML fails with Could not find py4j jar error
Problem PyPMML is a Python PMML scoring library. After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error. %python from pypmml import Model modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml') Error : Py4JError: Could not find py4j jar at Cause This error occurs due to a dependency on the defa...
TensorFlow fails to import
Problem You have TensorFlow installed on your cluster. When you try to import TensorFlow, it fails with an Invalid Syntax or import error. Cause The version of protobuf installed on your cluster is not compatible with your version of TensorFlow. Solution Use a cluster-scoped init script to install TensorFlow with matching versions of NumPy and proto...
Verify the version of Log4j on your cluster
Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. If you are using Log4j within your cluster (for example, if you are processing ...
Apache Spark jobs fail with Environment directory not found error
Problem After you install a Python library (via the cluster UI or by using pip), your Apache Spark jobs fail with an Environment directory not found error message. org.apache.spark.SparkException: Environment directory not found at /local_disk0/.ephemeral_nfs/cluster_libraries/python Cause Libraries are installed on a Network File System (NFS) on th...
Use Databricks Repos with Docker container services
Introduction Depending on your use case, you may want to use both Docker Container Services (DCS) and Databricks Repos (AWS | Azure | GCP) at the same time. DCS does not work with Databricks Repos by default, however you can use a custom init script to use both. If you have not installed an init script to configure DCS with Databricks Repos you may ...
Copy installed libraries from one cluster to another
If you have a highly customized Databricks cluster, you may want to duplicate it and use it for other projects. When you clone a cluster, only the Apache Spark configuration and other cluster configuration information is copied. Installed libraries are not copies by default. To copy the installed libraries, you can run a Python script after cloning ...
Conda fails to download packages from Anaconda
Problem You are attempting to download packages from the Anaconda repository and get a PackagesNotFoundError error message. This error can occur when using %conda, or %sh conda in notebooks, and when using Conda in an init script. Cause Anaconda Inc. updated the terms of service for repo.anaconda.com and anaconda.org/anaconda. Based on the Anaconda ...
Download artifacts from MLflow
By default, the MLflow client saves artifacts to an artifact store URI during an experiment. The artifact store URI is similar to /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. This artifact store is a MLflow managed location, so you cannot download artifacts directly. You must use client.download_artifacts in the ...
How to extract feature information for tree-based Apache SparkML pipeline models
When you are fitting a tree-based model, such as a decision tree, random forest, or gradient boosted tree, it is helpful to be able to review the feature importance levels along with the feature names. Typically models in SparkML are fit as the last stage of the pipeline. To extract the relevant feature information from the pipeline with the tree mo...
Fitting an Apache SparkML model throws error
Problem Databricks throws an error when fitting a SparkML model or Pipeline: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168, 10.205.250.130, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfu...
H2O.ai Sparkling Water cluster not reachable
Problem You are trying to initialize H2O.ai’s Sparkling Water on Databricks Runtime 7.0 and above when you get a H2OClusterNotReachableException error message. %python import ai.h2o.sparkling._ val h2oContext = H2OContext.getOrCreate() ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster X.X.X.X:54321 - sparkling-water-ro...
How to perform group K-fold cross validation with Apache Spark
Cross validation randomly splits the training data into a specified number of folds. To prevent data leakage where the same data shows up in multiple folds you can use groups. scikit-learn supports group K-fold cross validation to ensure that the folds are distinct and non-overlapping. On Spark you can use the spark-sklearn library, which distribute...
Error when importing OneHotEncoderEstimator
Problem You have migrated a notebook from Databricks Runtime 6.4 for Machine Learning or below to Databricks Runtime 7.3 for Machine Learning or above. You are attempting to import OneHotEncoderEstimator and you get an import error. ImportError: cannot import name 'OneHotEncoderEstimator' from 'pyspark.ml.feature' (/databricks/spark/python/pyspark/m...
MLflow project fails to access an Apache Hive table
Problem You have an MLflow project that fails to access a Hive table and returns a Table or view not found error. pyspark.sql.utils.AnalysisException: "Table or view not found: `default`.`tab1`; line 1 pos 21;\n'Aggregate [unresolvedalias(count(1), None)]\n+- 'UnresolvedRelation `default`.`tab1`\n" xxxxx ERROR mlflow.cli: === Run (ID 'xxxxx') failed...
How to speed up cross-validation
Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up: Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multi...
Hyperopt fails with maxNumConcurrentTasks error
Problem You are tuning machine learning parameters using Hyperopt when your job fails with a py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist error. You are using a Databricks Runtime for Machine Learning (Databricks Runtime ML) cluster. Cause Databricks Runtime ML has a compatible version of Hyperopt pre-installed (AWS | Azure | ...
Incorrect results when using documents as inputs
Problem You have a ML model that takes documents as inputs, specifically, an array of strings. You use a feature extractor like TfidfVectorizer to convert the documents to an array of strings and ingest the array into the model. The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JS...
Errors when accessing MLflow artifacts without using the MLflow client
MLflow experiment permissions (AWS | Azure) are now enforced on artifacts in MLflow Tracking, enabling you to easily control access to your datasets, models, and other files. Invalid mount exception Problem When trying to access an MLflow run artifact using Databricks File System (DBFS) commands, such as dbutils.fs, you get the following error: com....
Experiment warning when custom artifact storage location is used
Problem When you create an MLflow experiment with a custom artifact location, you get the following warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. MLflow cannot guarantee the enforcement of access controls on ar...
Experiment warning when legacy artifact storage location is used
Problem A new icon appears on the MLflow Experiments page with the following open access warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. In MLflow 1.11 and above, new experiments store artifacts in an MLflow-mana...
KNN model using pyfunc returns ModuleNotFoundError or FileNotFoundError
Problem You have created a Sklearn model using KNeighborsClassifier and are using pyfunc to run a prediction. For example: %python import mlflow.pyfunc pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type='string') predicted_df = merge.withColumn("prediction", pyfunc_udf(*merge.columns[1:])) predicted_df.collect() The predict...
OSError when accessing MLflow experiment artifacts
Problem You get an OSError: No such file or directory error message when trying to download or log artifacts using one of the following: MlflowClient.download_artifacts() mlflow.[flavor].log_model() mlflow.[flavor].load_model() mlflow.log_artifacts() OSError: No such file or directory: '/dbfs/databricks/mlflow-tracking/<experiment-id>/<run-...
PERMISSION_DENIED error when accessing MLflow experiment artifact
Problem You get a PERMISSION_DENIED error when trying to access an MLflow artifact using the MLflow client. RestException: PERMISSION_DENIED: User <user> does not have permission to 'View' experiment with id <experiment-id> or RestException: PERMISSION_DENIED: User <user> does not have permission to 'Edit' experiment with id <ex...
Python commands fail on Machine Learning clusters
Problem You are using a Databricks Runtime for Machine Learning cluster and Python notebooks are failing. You find an invalid syntax error in the logs. SyntaxError: invalid syntax File "/local_disk0/tmp/1593092990800-0/PythonShell.py", line 363 def __init__(self, *args, condaMagicHandler=None, **kwargs): Cause Key values in the /etc/environmen...
Runs are not nested when SparkTrials is enabled in Hyperopt
Problem SparkTrials is an extension of Hyperopt, which allows runs to be distributed to Spark workers. When you start an MLflow run with nested=True in the worker function, the results are supposed to be nested under the parent run. Sometimes the results are not correctly nested under the parent run, even though you ran SparkTrials with nested=True ...
MLflow 'invalid access token' error
Problem You have long-running MLflow tasks in your notebook or job and the tasks are not completed. Instead, they return a (403) Invalid access token error message. Error stack trace: MlflowException: API request to endpoint /api/2.0/mlflow/runs/create failed with error code 403 != 200. Response body: '<html> <head> <meta data-fr-htt...
Autoscaling is slow with an external metastore
Problem You have an external metastore configured on your cluster and autoscaling is enabled, but the cluster is not autoscaling effectively. Cause You are copying the metastore jars to every executor, when they are only needed in the driver. It takes time to initialize and run the jars every time a new executor spins up. As a result, adding more ex...
Data too long for column error
Problem You are trying to insert a struct into a table, but you get a java.sql.SQLException: Data too long for column error. Caused by: java.sql.SQLException: Data too long for column 'TYPE_NAME' at row 1 Query is: INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,`COLUMN_NAME`,TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?) , parameters [103182,<null>,'address...
Drop database without deletion
By default, the DROP DATABASE (AWS | Azure | GCP) command drops the database and deletes the directory associated with the database from the file system. Sometimes you may want to drop the database, but keep the underlying database directory intact. Example code You can use this example code to drop the database without dropping the underlying stora...
How to create table DDLs to import into an external metastore
Databricks supports using external metastores instead of the default Hive metastore. You can export all table metadata from Hive to the external metastore. Use the Apache Spark Catalog API to list the tables in the databases contained in the metastore. Use the SHOW CREATE TABLE statement to generate the DDLs and store them in a file. Use the file to...
Drop tables with corrupted metadata from the metastore
Problem Sometimes you cannot drop a table from the Databricks UI. Using %sql or spark.sql to drop table doesn’t work either. Cause The metadata (table schema) stored in the metastore is corrupted. When you run Drop table command, Spark checks whether table exists or not before dropping the table. Since the metadata is corrupted for the table Spark c...
Error in CREATE TABLE with external Hive metastore
Problem You are connecting to an external MySQL metastore and attempting to create a table when you get an error. AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:An exception was thrown while adding/validating class(es) : (conn=21) Column length too big for column 'PARAM_VALUE' (max = 16383); use BLOB or TE...
AnalysisException when dropping table on Azure-backed metastore
Problem When you try to drop a table in an external Hive version 2.0 or 2.1 metastore that is deployed on Azure SQL Database, Databricks throws the following exception: com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(...
How to troubleshoot several Apache Hive metastore problems
Problem 1: External metastore tables not available When you inspect the driver logs, you see a stack trace that includes the error Required table missing: WARN Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MDatabase and subclasses resulted in no possible candidates Required table missing: "DBS" in Catalog "" Schema "". DataNu...
Listing table names
Problem To fetch all the table names from metastore you can use either spark.catalog.listTables() or %sql show tables. If you observe the duration to fetch the details you can see spark.catalog.listTables() usually takes longer than %sql show tables. Cause spark.catalog.listTables() tries to fetch every table’s metadata first and then show the reque...
How to set up an embedded Apache Hive metastore
You can set up a Databricks cluster to use an embedded metastore. You can use an embedded metastore when you only need to retain table metadata during the life of the cluster. If the cluster is restarted, the metadata is lost. If you need to persist the table metadata or other data after a cluster restart, then you should use the default metastore o...
Japanese character support in external metastore
Problem You are trying to use Japanese characters in your tables, but keep getting errors. Create a table with the OPTIONS keyword OPTIONS provides extra metadata to the table. You try creating a table with OPTIONS and specify the charset as utf8mb4. %sql CREATE TABLE default.JPN_COLUMN_NAMES('作成年月' string ,'計上年月' string ,'所属コード' string ,'生保代理店コード_8...
Parquet timestamp requires Hive metastore 1.2 or above
Problem You are trying to create a Parquet table using TIMESTAMP, but you get an error message. Error in SQL statement: QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384 Example code %sql CREATE EXTERN...
How to explore Apache Spark metrics with Spark listeners
Apache Spark provides several useful internal listeners that track metrics about tasks and jobs. During the development cycle, for example, these metrics can help you to understand when and why a task takes a long time to finish. Of course, you can leverage the Spark UI or History UI to see information for each task and stage, but there are some dow...
How to use Apache Spark metrics
This article gives an example of how to monitor Apache Spark components using the Spark configurable metrics system. Specifically, it shows how to set a new source and enable a sink. For detailed information about the Spark components available for metrics collection, including sinks supported out of the box, follow the documentation link above. Inf...
Access S3 with temporary session credentials
You can use IAM session tokens with Hadoop config support to access S3 storage in Databricks Runtime 8.3 and above. Info You cannot mount the S3 path as a DBFS mount when using session credentials. You must use the S3A URI. Extract the session credentials from your cluster Extract the session credentials from your cluster. You will need the Instance...
Cannot use IAM roles with table ACL
Problem You want to use IAM roles when table ACLs are enabled, but you get an error saying credentials cannot be located. NoCredentialsError: Unable to locate credentials Cause When a table ACL is enabled, access to the EC2 instance metadata service is blocked. This is a security measure that prevents users from obtaining IAM access credentials. Sol...
Enable s3cmd for notebooks
s3cmd is a client library that allows you to perform all AWS S3 operations from any machine. s3cmd is not installed on Databricks clusters by default. You must install it via a cluster-scoped init script before it can be used. Info The sample init script stores the path to a secret in an environment variable. You should store secrets in this fashion...
How to check if a spark property is modifiable in a notebook
Problem You can tune applications by setting various configurations. Some configurations must be set at the cluster level, whereas some are set inside notebooks or applications. Solution To check if a particular Spark configuration can be set in a notebook, run the following command in a notebook cell: %scala spark.conf.isModifiable("spark.databrick...
JSON reader parses values as null
Problem You are attempting to read a JSON file. You know the file has data in it, but the Apache Spark JSON reader is returning a null value. Example code You can use this example code to reproduce the problem. Create a test JSON file in DBFS.%python dbutils.fs.rm("dbfs:/tmp/json/parse_test.txt") dbutils.fs.put("dbfs:/tmp/json/parse_test.txt", """ {...
Common errors in notebooks
There are some common issues that occur when using notebooks. This section outlines some of the frequently asked questions and best practices that you should follow. Spark job fails with java.lang.NoClassDefFoundError Sometimes you may come across an error like: %scala java.lang.NoClassDefFoundError: Could not initialize class line.....$read$ This c...
display() does not show microseconds correctly
Problem You want to display a timestamp value with microsecond precision, but when you use display() it does not show the value past milliseconds. For example, this Apache Spark SQL display() command: %sql display(spark.sql("select cast('2021-08-10T09:08:56.740436' as timestamp) as test")) Returns a truncated value: 2021-08-10T09:08:56.740+0000 Caus...
Error: Received command c on object id p0
Problem You have imported Python libraries, but when you try to execute Python code in a notebook you get a repeating message as output. INFO:py4j.java_gateway:Received command c on object id p0 INFO:py4j.java_gateway:Received command c on object id p0 INFO:py4j.java_gateway:Received command c on object id p0 INFO:py4j.java_gateway:Received command ...
Failure when accessing or mounting storage
Problem You are trying to access an existing mount point, or create a new mount point, and it fails with an error message. Invalid Mount Exception:The backend could not get tokens for path /mnt. Cause The root mount path (/mnt) is also mounted to a storage location. You can verify that something is mounted to the root path by listing all mount point...
Item was too large to export
Problem You are trying to export notebooks using the workspace UI and are getting an error message. This item was too large to export. Try exporting smaller or fewer items. Cause The notebook files are larger than 10 MB in size. Solution The simplest solution is to limit the size of the notebook or folder that you are trying to download to 10 MB or ...
Access notebooks owned by a deleted user
When you remove a user (AWS | Azure) from Databricks, a special backup folder is created in the workspace. This backup folder contains all of the deleted user’s content. Backup folders appear in the workspace as <deleted username>-backup-#. Info Only an admin user can access a backup folder. To access a backup folder: Log into Databricks as an...
Notebook autosave fails due to file size limits
Problem Notebook autosaving fails with the following error message: Failed to save revision: Notebook size exceeds limit. This is most commonly caused by cells with large results. Remove some cells or split the notebook. Cause The maximum notebook size allowed for autosaving is 8 MB. Solution First, check the size of your notebook file using your br...
How to send email or SMS messages from Databricks notebooks
You may need to send a notification to a set of recipients from a Databricks notebook. For example, you may want to send email based on matching business rules or based on a command’s success or failure. This article describes two approaches to sending email or SMS messages from a notebook. Both examples use Python notebooks: Send email or SMS messa...
Cannot run notebook commands after canceling streaming cell
Problem After you cancel a running streaming cell in a notebook attached to a Databricks Runtime 5.0 cluster, you cannot run any subsequent commands in the notebook. The commands are left in the “waiting to run” state, and you must clear the notebook’s state or detach and reattach the cluster before you can successfully run commands on the notebook....
Troubleshooting unresponsive Python notebooks or canceled commands
This article provides an overview of troubleshooting steps you can take if a notebook is unresponsive or cancels commands. Check metastore connectivity Problem Simple commands in newly-attached notebooks fail, but succeed in notebooks that were attached to the same cluster earlier. Troubleshooting steps Check metastore connectivity. The inability to...
Update job permissions for multiple users
When you are running jobs, you might want to update user permissions for multiple users. You can do this by using the Databricks job permissions API (AWS | Azure | GCP) and a bit of Python code. Instructions Copy the example code into a notebook. Enter the <job-id> (or multiple job ids) into the array arr[]. Enter your payload{}. In this examp...
Service principal cannot create access token
Problem You are trying to create a token on behalf of a service principal , using /2.0/token-management/on-behalf-of/tokens in the REST API but are getting a PERMISSION_DENIED error. { "error_code": "PERMISSION_DENIED", "message": "User xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx does not have permission to use tokens." } Cause This happens when the servic...
Generate browser HAR files
When troubleshooting UI issues, it is sometimes necessary to obtain additional information about the network requests that are generated in your browser. If this is needed, our support team will ask you to generate a HAR file. This article describes how to generate a HAR file with each of the major web browsers. Warning HAR files contain sensitive d...
Recover deleted notebooks from the Trash
DBR Version: This applies to the Workspace UI, so it is separate from DBR versions. Category: Notebooks Secondary category: <list secondary category, if applicable> Cloud Version: AWS, Azure, GCP Author: vivian.wilfred@databricks.com Owning Team: <India + Platform> Ticket URL: <Link to original Salesforce or Jira ticket> Last revie...
Get workspace configuration details
This article explains how to display the complete configuration details for your Databricks workspace. This can be useful if you want to review the configuration settings and services that are enabled in your workspace. For example, you can use the workspace configuration details to quickly see if Unity Catalog or Identity Federation is enabled on y...
Iterate through all jobs in the workspace using Jobs API 2.1
In the Databricks Jobs API 2.0 (AWS | Azure | GCP) list returns an unbounded number of job descriptions. In the Jobs API 2.1 (AWS | Azure | GCP), this behavior has changed. The list command now returns a maximum of 25 jobs, from newest to oldest, at a time. In this article we show you how to manually iterate through all of the jobs in your workspace...
Too many execution contexts are open right now
Problem You come across the below error message when you try to attach a notebook to a cluster or in a job failure. Run result unavailable: job failed with error message Too many execution contexts are open right now.(Limit set currently to 150) Cause Databricks create an execution context when you attach a notebook to a cluster. The execution cont...
Forbidden error while accessing S3 data
Problem While trying to access S3 data using DBFS mount or directly in Spark APIs, the command fails with an exception similar to the following: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; Request ID: XXXXXXXXXXXXX, Extended Request ID: XXXXXXXXXXXXXXXXXXX, Cloud Provider: AWS, Instance ID: XXXXXXXXXX (Service: Amazon S3; Status Co...
Table creation fails with security exception
Problem You attempt to create a table using a cluster that has Table ACLs enabled, but the following error occurs: Error in SQL statement: SecurityException: User does not have permission SELECT on any file. Cause This error occurs on a Table ACL-enabled cluster if you are not an administrator and you do not have sufficient privileges to create a ta...
Set an unlimited lifetime for service principal access token
Info This article applies to clusters using Databricks Runtime 11.2 and above on AWS. You are using Databricks service principals to delegate permissions to automated tools and systems. In order to secure the service principals, Databricks recommends creating an access token for the service principal. Please review the create a Databricks access tok...
Connection retries take a long time to fail
Problem You are trying to access a table on a remote HDFS location or an object store that you do not have permission to access. The SELECT command should fail, and it does, but it does not fail quickly. It can take up to ten minutes, sometimes more, to return a ConnectTimeoutException error message. The error message they eventually receive is : " ...
Search audit logs for connections from prohibited IP addresses
IP access lists can be used to restrict access to Databricks based on known network locations. Once enabled, an IP access list requires uses to login from an allowed address. If a user attempts to login from any IP address not on the access list, the login is denied. Review the IP access list documentation for more details. Best practices involve pe...
Append output is not supported without a watermark
Problem You are performing an aggregation using append mode and an exception error message is returned. Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark Cause You cannot use append mode on an aggregated DataFrame without a watermark. This is by design. Solution You must apply a...
Apache Spark DStream is not supported
Problem You are attempting to use a Spark Discretized Stream (DStream) in a Databricks streaming job, but the job is failing. Cause DStreams and the DStream API are not supported by Databricks. Solution Instead of using Spark DStream, you should migrate to Structured Streaming. Review the Databricks Structured Streaming in production (AWS | Azure | ...
Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
When you stream data into a file sink, you should always change both checkpoint and output directories together. Otherwise, you can get failures or unexpected outputs. Apache Spark creates a folder inside the output directory named _spark_metadata. This folder contains write-ahead logs for every batch run. This is how Spark gets exactly-once guarant...
Get the path of files consumed by Auto Loader
When you process streaming files with Auto Loader (AWS | Azure | GCP), events are logged based on the files created in the underlying storage. This article shows you how to add the file path for every filename to a new column in the output DataFrame. One use case for this is auditing. When files are ingested to a partitioned folder structure there i...
How to set up Apache Kafka on Databricks
This article explains how to set up Apache Kafka on AWS EC2 machines and connect them with Databricks. Following are the high level steps that are required to create a Kafka cluster and connect from Databricks notebooks. Step 1: Create a new VPC in AWS When creating the new VPC, set the new VPC CIDR range different than the Databricks VPC CIDR range...
Handling partition column values while using an SQS queue as a streaming source
Problem If data in S3 is stored by partition, the partition column values are used to name folders in the source directory structure. However, if you use an SQS queue as a streaming source, the S3-SQS source cannot detect the partition column values. For example, if you save the following DataFrame to S3 in JSON format: %scala val df = spark.range(1...
How to restart a structured streaming query from last written offset
Scenario You have a stream, running a windowed aggregation query, that reads from Apache Kafka and writes files in Append mode. You want to upgrade the application and restart the query with the offset equal to the last written offset. You want to discard all state information that hasn’t been written to the sink, start processing from the earliest ...
How to switch a SNS streaming job to a new SQS queue
Problem You have a Structured Streaming job running via the S3-SQS connector. Suppose you want to recreate the source SQS, backed by SNS data, and you want to proceed with a new queue to be processed in the same job and in the same output directory. Solution Use the following procedure: Create new SQS queues and subscribe to s3-events (from SNS). At...
Kafka error: No resolvable bootstrap urls
Problem You are trying to read or write data to a Kafka stream when you get an error message. kafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer Caused by: kafkashaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers If you are running a notebook, the error me...
readStream() is not whitelisted error when running a query
Problem You have table access control (AWS | Azure | GCP) enabled on your cluster. You are trying to run a structured streaming query and get and error message. py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.s...
Checkpoint files not being deleted when using display()
Problem You have a streaming job using display() to display DataFrames. %scala val streamingDF = spark.readStream.schema(schema).parquet(<input_path>) display(streamingDF) Checkpoint files are being created, but are not being deleted. You can verify the problem by navigating to the root directory and looking in the /local_disk0/tmp/ folder. Ch...
Checkpoint files not being deleted when using foreachBatch()
Problem You have a streaming job using foreachBatch() to process DataFrames. %scala streamingDF.writeStream.outputMode("append").foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.write.format("parquet").mode("overwrite").save(output_directory) }.start() Checkpoint files are being created, but are not being deleted. You can verify th...
Conflicting directory structures error
Problem You have an Apache Spark job that is failing with a Java assertion error java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Example stack trace Caused by: org.apache.spark.sql.streaming.StreamingQueryException: There was an error when trying to infer the partition schema of the current batch of files. Plea...
RocksDB fails to acquire a lock
Problem You are trying to use RocksDB as a state store for your structured streaming application, when you get an error message saying that the instance could not be acquired. Caused by: java.lang.IllegalStateException: RocksDB instance could not be acquired by [ThreadId: 742, task: 140.3 in stage 3152, TID 553193] as it was not released by [ThreadI...
Stream XML files using an auto-loader
Apache Spark does not include a streaming API for XML files. However, you can combine the auto-loader features of the Spark batch API with the OSS library, Spark-XML, to stream XML files. In this article, we present a Scala based solution that parses XML data using an auto-loader. Install Spark-XML library You must install the Spark-XML OSS library ...
Streaming job using Kinesis connector fails
Problem You have a streaming job writing to a Kinesis sink, and it is failing with out of memory error messages. java.lang.OutOfMemoryError: GC Overhead limit exceeded java.lang.OutOfMemoryError: Java heap space. Symptoms include: Ganglia shows a gradual increase in JVM memory usage. Microbatch analysis shows input and processing rates are consisten...
Streaming job gets stuck writing to checkpoint
Problem You are monitoring a streaming job, and notice that it appears to get stuck when processing data. When you review the logs, you discover the job gets stuck when writing data to a checkpoint. INFO HDFSBackedStateStoreProvider: Deleted files older than 381160 for HDFSStateStoreProvider[id = (op=0,part=89),dir = dbfs:/FileStore/R_CHECKPOINT5/st...
Explicit path to data or a defined schema required for Auto loader
Info This article applies to Databricks Runtime 9.1 LTS and above. Problem You are using Auto Loader to ingest data for your ELT pipeline when you get an IllegalArgumentException: Please provide the source directory path with option `path` error message. You get this error when you start an Auto Loader job, if either the path to the data or the data...
Optimize streaming transactions with .trigger
When running a structured streaming application that uses cloud storage buckets (S3, ADLS Gen2, etc.) it is easy to incur excessive transactions as you access the storage bucket. Failing to specify a .trigger option in your streaming code is one common reason for a high number of storage transactions. When a .trigger option is not specified, the sto...
Structured streaming jobs slow down on every 10th batch
Problem You are running a series of structured streaming jobs and writing to a file sink. Every 10th run appears to run slower than the previous jobs. Cause The file sink creates a _spark_metadata folder in the target path. This metadata folder stores information about each batch, including which files are part of the batch. This is required to prov...
Get last modification time for all files in Auto Loader and batch jobs
You are running a streaming job with Auto Loader (AWS | Azure | GCP) and want to get the last modification time for each file from the storage account. Instructions The Get the path of files consumed by Auto Loader article describes how to get the filenames and paths for all files consumed by the Auto Loader. In this article, we build on that founda...
How to save Plotly files and display From DBFS
You can save a chart generated with Plotly to the driver node as a jpg or png file. Then, you can display it in a notebook by using the displayHTML() method. By default, you save Plotly charts to the /databricks/driver/ directory on the driver node in your cluster. Use the following procedure to display the charts at a later time. Generate a sample ...
AttributeError: ‘function’ object has no attribute
Problem You are selecting columns from a DataFrame and you get an error message. ERROR: AttributeError: 'function' object has no attribute '_get_object_id' in job Cause The DataFrame API contains a small number of protected keywords. If a column in your DataFrame uses a protected keyword as the column name, you will get an error message. For example...
Convert Python datetime object to string
There are multiple ways to display date and time values with Python, however not all of them are easy to read. For example, when you collect a timestamp column from a DataFrame and save it as a Python variable, the value is stored as a datetime object. If you are not familiar with the datetime object format, it is not as easy to read as the common Y...
Create a cluster with Conda
Conda is a popular open source package management system for the Anaconda repo. Databricks Runtime for Machine Learning (Databricks Runtime ML) uses Conda to manage Python library dependencies. If you want to use Conda, you should use Databricks Runtime ML. Attempting to install Anaconda or Conda for use with Databricks Runtime is not supported. Fol...
Display file and directory timestamp details
In this article we show you how to display detailed timestamps, including the date and time when a file was created or modified. Use ls command The simplest way to display file timestamps is to use the ls -lt <path> command in a bash shell. For example, this sample command displays basic timestamps for files and directories in the /dbfs/ folde...
Install and compile Cython
This document explains how to run Spark code with compiled Cython code. The steps are as follows: Creates an example Cython module on DBFS (AWS | Azure). Adds the file to the Spark session. Creates a wrapper method to load the module on the executors. Runs the mapper on a sample dataset. Generate a larger dataset and compare the performance with nat...
Reading large DBFS-mounted files using Python APIs
This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs. Problem If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error: /databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextRead...
Use the HDFS API to read files in Python
There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts. AWS Use the following example code for S3 bucket storage. %python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apa...
How to import a custom CA certificate
When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints. ConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb73dc3b3d0>: Failed t...
Job remains idle before starting
Problem You have an Apache Spark job that is triggered correctly, but remains idle for a long time before starting. You have a Spark job that ran well for awhile, but goes idle for a long time before resuming. Symptoms include: Cluster downscales to the minimum number of worker nodes during idle time. Driver logs don’t show any Spark jobs during idl...
List all workspace objects
You can use the Databricks Workspace API (AWS | Azure | GCP) to recursively list all workspace objects under a given path. Common use cases for this include: Indexing all notebook names and types for all users in your workspace. Use the output, in conjunction with other API calls, to delete unused workspaces or to manage notebooks. Dynamically get t...
Load special characters with Spark-XML
Problem You have special characters in your source files and are using the OSS library Spark-XML. The special characters do not render correctly. For example, “CLU®” is rendered as “CLU�”. Cause Spark-XML supports the UTF-8 character set by default. You are using a different character set in your XML files. Solution You must specify the character se...
Python commands fail on high concurrency clusters
Problem You are attempting to run Python commands on a high concurrency cluster. All Python commands fail with a WARN error message. WARN PythonDriverWrapper: Failed to start repl ReplId-61bef-9fc33-1f8f6-2 ExitCodeException exitCode=1: chown: invalid user: ‘spark-9fcdf4d2-045d-4f3b-9293-0f’ Cause Both spark.databricks.pyspark.enableProcessIsolation...
Cluster cancels Python command execution after installing Bokeh
Problem The cluster returns Cancelled in a Python notebook. Inspect the driver log (std.err) in the Cluster Configuration page for a stack trace and error message similar to the following: log4j:WARN No appenders could be found for logger (com.databricks.conf.trusted.ProjectConf$). log4j:WARN Please initialize the log4j system properly. log4j:WARN S...
Cluster cancels Python command execution due to library conflict
Problem The cluster returns Cancelled in a Python notebook. Notebooks in all other languages execute successfully on the same cluster. Cause When you install a conflicting version of a library, such as ipython, ipywidgets, numpy, scipy, or pandas to the PYTHONPATH, then the Python REPL can break, causing all commands to return Cancelled after 30 sec...
Python command execution fails with AttributeError
This article can help you resolve scenarios in which Python command execution fails with an AttributeError. Problem: 'tuple' object has no attribute 'type' When you run a notebook, Python command execution fails with the following error and stack trace: AttributeError: 'tuple' object has no attribute 'type' Traceback (most recent call last): File "/...
Python REPL fails to start in Docker
Problem When you use a Docker container that includes prebuilt Python libraries, Python commands fail and the virtual environment is not created. The following error message is visible in the driver logs. 20/02/29 16:38:35 WARN PythonDriverWrapper: Failed to start repl ReplId-5b591-0ce42-78ef3-7 java.io.IOException: Cannot run program "/local_disk0/...
How to run SQL queries from Python scripts
You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script. Info Pytho...
Python 2 sunset status
Python.org officially moved Python 2 into EoL (end-of-life) status on January 1, 2020. What does this mean for you? Databricks Runtime 6.0 and above Databricks Runtime 6.0 and above support only Python 3. You cannot create a cluster with Python 2 using these runtimes. Any clusters created with these runtimes use Python 3 by definition. Databricks Ru...
Job fails with Java IndexOutOfBoundsException error
Problem Your job fails with a Java IndexOutOfBoundsException error message: java.lang.IndexOutOfBoundsException: index: 0, length: <number> (expected: range(0, 0)) When you review the stack trace you see something similar to this: Py4JJavaError: An error occurred while calling o617.count. : org.apache.spark.SparkException: Job aborted due to s...
Change version of R (r-base)
These instructions describe how to install a different version of R (r-base) on a cluster. You can check the default r-base version that each Databricks Runtime version is installed with in the System environment section of each Databricks Runtime release note (AWS | Azure | GCP). List available r-base-core versions To list the versions of r-base-co...
Fix the version of R packages
When you use the install.packages() function to install CRAN packages, you cannot specify the version of the package, because the expectation is that you will install the latest version of the package and it should be compatible with the latest version of its dependencies. If you have an outdated dependency installed, it will be updated as well. Som...
How to parallelize R code with gapply
Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed com...
How to parallelize R code with spark.lapply
Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed com...
How to persist and share code in RStudio
Problem Unlike a Databricks notebook that has version control built in, code developed in RStudio is lost when the high concurrency cluster hosting Rstudio is shut down. Solution To persist and share code in RStudio, do one of the following: From RStudio, save the code to a folder on DBFS which is accessible from both Databricks notebooks and RStudi...
Install rJava and RJDBC libraries
This article explains how to install rJava and RJBDC libraries. Problem When you install rJava and RJDBC libraries with the following command in a notebook cell: %r install.packages(c("rJava", "RJDBC")) You observe the following error: ERROR: configuration failed for package 'rJava' Cause The rJava and RJDBC packages check for Java dependencies and ...
Rendering an R markdown file containing sparklyr code fails
Problem After you install and configure RStudio in the Databricks environment, when you launch RStudio and click the Knit button to knit a Markdown file that contains code to initialize a sparklyr context, rendering fails with the following error: failed to start sparklyr backend:object 'DATABRICKS_GUID' not found Calls: <Anonymous>… tryCatch ...
Resolving package or namespace loading error
This article explains how to resolve a package or namespace loading error. Problem When you install and load some libraries in a notebook cell, like: %r library(BreakoutDetection) You may get a package or namespace error: Loading required package: BreakoutDetection: Error : package or namespace load failed for ‘BreakoutDetection’ in loadNamespace(i,...
RStudio server backend connection error
Problem You get a backend connection error when using RStudio server. Error in Sys.setenv(EXISTING_SPARKR_BACKEND_PORT = system(paste0("wget -qO - 'http://localhost:6061/?type=\"com.databricks.backend.common.rpc.DriverMessages$StartRStudioSparkRBackend\"' --post-data='{\"@class\":\"com.databricks.backend.common.rpc.DriverMessages$StartRStudioSparkRB...
Verify R packages installed via init script
When you configure R packages to install via an init script, it is possible for a package install to fail if dependencies are not installed. You can use the R commands in a notebook to check that all of the packages correctly installed. Info This article does require you to provide a list of packages to check against. List installed packages Make a ...
Apache Spark UI is not in sync with job
Problem The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank. When you review the driver logs, you see an AsyncEventQueue warning. Logs ===== 20/12/23 21:20:26 ...
Apache Spark job fails with Parquet column cannot be converted error
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message. The cluster is running Databricks Runtime 7.3 LTS or above. org.apache.spark.SparkException: Task failed while writing rows. Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://buc...
Best practice for cache(), count(), and take()
cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(),...
Cannot import timestamp_millis or unix_millis
Problem You are trying to import timestamp_millis or unix_millis into a Scala notebook, but get an error message. %scala import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} error: value timestamp_millis is not a member of object org.apache.spark.sql.functions import org.apache.spark.sql.functions.{timestamp_millis, unix_millis} Cau...
Cannot modify the value of an Apache Spark config
Problem You are trying to SET the value of a Spark config in a notebook and get a Cannot modify the value of a Spark config error. For example: %sql SET spark.serializer=org.apache.spark.serializer.KryoSerializer Error in SQL statement: AnalysisException: Cannot modify the value of a Spark config: spark.serializer; Cause The SET command does not wor...
Convert flattened DataFrame to nested JSON
This article explains how to convert a flattened DataFrame to a nested structure, by nesting a case class within another case class. You can use this technique to build a JSON file, that can then be sent to an external API. Define nested schema We’ll start with a flattened DataFrame. Using this example DataFrame, we define a custom nested schema usi...
Convert nested JSON to a flattened DataFrame
This article shows you how to flatten nested JSON, using only $"column.*" and explode methods. Sample JSON file Pass the sample JSON string to the reader. %scala val json =""" { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": ...
Create a DataFrame from a JSON string or Python dictionary
In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...
Decimal$DecimalIsFractional assertion error
Problem You are running a job on Databricks Runtime 7.x or above when you get a java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional error message. Example stack trace: java.lang.AssertionError: assertion failed: Decimal$DecimalIsFractional while compiling: <notebook> during phase: globalPhase=terminal, enteringPhase=j...
from_json returns null in Apache Spark 3.0
Problem The from_json function is used to parse a JSON string and return a struct of values. For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return. %python from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(c...
Intermittent NullPointerException when AQE is enabled
Problem You get an intermittent NullPointerException error when saving your data. Py4JJavaError: An error occurred while calling o2892.save. : java.lang.NullPointerException at org.apache.spark.sql.execution.adaptive.OptimizeSkewedJoin.$anonfun$getMapSizesForReduceId$1(OptimizeSkewedJoin.scala:167) at org.apache.spark.sql.execution.adaptive....
Manage the size of Delta tables
Delta tables are different than traditional tables. Delta tables include ACID transactions and time travel features, which means they maintain transaction logs and stale data files. These additional features require storage space. In this article we discuss recommendations that can help you manage the size of your Delta tables. Enable file system ve...
Trouble reading external JDBC tables after upgrading from Databricks Runtime 5.5
Problem Attempting to read external tables via JDBC works fine on Databricks Runtime 5.5, but the same table reads fail on Databricks Runtime 6.0 and above. You see an error similar to the following: com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisExc...
Select files using a pattern match
When selecting files, a common requirement is to only read specific files from a folder. For example, if you are processing logs, you may want to read files from a specific month. Instead of enumerating each file and folder to find the desired files, you can use a glob pattern to match multiple files with a single expression. This article uses examp...
Multiple Apache Spark JAR jobs fail when run concurrently
Problem If you run multiple Apache Spark JAR jobs concurrently, some of the runs might fail with the error: org.apache.spark.sql.AnalysisException: Table or view not found: xxxxxxx; line 1 pos 48 Cause This error occurs due to a bug in Scala. When an object extends App, its val fields are no longer immutable and they can be changed when the main met...
Write a DataFrame with missing columns to a Redshift table
Problem When writing to Redshift tables, if the target table has more columns than the source Apache Spark DataFrame you may get a copy error. The COPY failed with error: [Amazon][Amazon Redshift] (1203) Error occurred while trying to execute a query: ERROR: Load into table table-name failed. Check the 'stl_load_errors' system table for details. “12...
Job fails with ExecutorLostFailure due to “Out of memory” error
Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...
Job fails with ExecutorLostFailure because executor is busy
Problem Job fails with an ExecutorLostFailure error message. ExecutorLostFailure (executor <1> exited caused by one of the running tasks) Reason: Executor heartbeat timed out after <148564> ms Cause The ExecutorLostFailure error message means one of the executors in the Apache Spark cluster has been lost. This is a generic error message ...
Understanding speculative execution
Speculative execution Speculative execution can be used to automatically re-attempt a task that is not making progress compared to other tasks in the same stage. This means if one or more tasks are running slower in a stage, they will be re-launched. The task that completes first is marked as successful. The other attempt gets killed. Implementatio...
Use custom classes and objects in a schema
Problem You are trying to create a dataset using a schema that contains Scala enumeration fields (classes and objects). When you run your code in a notebook cell, you get a ClassNotFoundException error. Sample code %scala object TestEnum extends Enumeration { type TestEnum = Value val E1, E2, E3 = Value } import spark.implicits._ import TestEnum._ c...
Broadcast join exceeds threshold, returns out of memory error
Problem You are attempting to join two large tables, projecting selected columns from the first table and all columns from the second table. Despite the total size exceeding the limit set by spark.sql.autoBroadcastJoinThreshold, BroadcastHashJoin is used and Apache Spark returns an OutOfMemorySparkException error. org.apache.spark.sql.execution.OutO...
Cannot grow BufferHolder; exceeds size limitation
Problem Your Apache Spark job fails with an IllegalArgumentException: Cannot grow BufferHolder error. java.lang.IllegalArgumentException: Cannot grow BufferHolder by size XXXXXXXXX because the size after growing exceeds size limitation 2147483632 Cause BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB). If a column value exceed...
Date functions only accept int values in Apache Spark 3.0
Problem You are attempting to use the date_add() or date_sub() functions in Spark 3.0, but they are returning an Error in SQL statement: AnalysisException error message. In Spark 2.4 and below, both functions work as normal. %sql select date_add(cast('1964-05-23' as date), '12.34') Cause You are attempting to use a fractional or string value as the ...
Disable broadcast when query plan has BroadcastNestedLoopJoin
This article explains how to disable broadcast when the query plan has BroadcastNestedLoopJoin in the physical plan. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadcast the bigger table and fails with a broadcast error. This behavior is...
Duplicate columns in the metadata error
Problem Your Apache Spark job is processing a Delta table when the job fails with an error message. org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the metadata update: col1, col2... Cause There are duplicate column names in the Delta table. Column names that differ only by case are considered duplicate. Delta Lake is case prese...
Generate unique increasing numeric values
This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...
Error in SQL statement: AnalysisException: Table or view not found
Problem When you try to query a table or view, you get this error: AnalysisException:Table or view not found when trying to query a global temp view Cause You typically create global temp views so they can be accessed from different sessions and kept alive until the application ends. You can create a global temp view with the following statement: %s...
Error when downloading full results after join
Problem You are working with two tables in a notebook. You perform a join. You can preview the output, but when you try to Download full results you get an error. Error in SQL statement: AnalysisException: Found duplicate column(s) when inserting into dbfs:/databricks-results/ Reproduce error Create two tables.%python from pyspark.sql.functions impo...
Error when running MSCK REPAIR TABLE in parallel
Problem You are trying to run MSCK REPAIR TABLE <table-name> commands for the same table in parallel and are getting java.net.SocketTimeoutException: Read timed out or out of memory error messages. Cause When you try to add a large number of new partitions to a table with MSCK REPAIR in parallel, the Hive metastore becomes a limiting factor, a...
Find the size of a table
This article explains how to find the size of a table. The command used depends on if you are trying to find the size of a delta table or a non-delta table. Size of a delta table To find the size of a delta table, you can use a Apache Spark SQL command. %scala import com.databricks.sql.transaction.tahoe._ val deltaLog = DeltaLog.forTable(spark, "dbf...
Inner join drops records in result
Problem You perform an inner join, but the resulting joined table is missing data. For example, assume you have two tables, orders and models. %python df_orders = spark.createDataFrame([('Nissan','Altima','2-door 2.5 S Coupe'), ('Nissan','Altima','4-door 3.5 SE Sedan'), ('Nissan','Altima',''), ('Nissan','Altima', None)], ["Company", "Model", "Info"]...
Data is incorrect when read from Snowflake
Problem You have a job that is using Apache Spark to read from a Snowflake table, but the time data that appears in the Dataframe is incorrect. If you run the same query directly on Snowflake, the correct time data is returned. Cause The time zone value was not correctly set. A mismatch between the time zone value of the Databricks cluster and Snowf...
JDBC write fails with a PrimaryKeyViolation error
Problem You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error. Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables. Cause When Apache Spark performs a JDBC write, one par...
Query does not skip header row on external table
Problem You are attempting to query an external Hive table, but it keeps failing to skip the header row, even though TBLPROPERTIES ('skip.header.line.count'='1') is set in the HiveContext. You can reproduce the issue by creating a table with this sample code. %sql CREATE EXTERNAL TABLE school_test_score ( `school` varchar(254), `student_id` varc...
SHOW DATABASES command returns unexpected column name
Problem You are using the SHOW DATABASES command and it returns an unexpected column name. Cause The column name returned by the SHOW DATABASES command changed in Databricks Runtime 7.0. Databricks Runtime 6.4 Extended Support and below: SHOW DATABASES returns namespace as the column name. Databricks Runtime 7.0 and above: SHOW DATABASES returns dat...
Cannot view table SerDe properties
Problem You are trying to view the SerDe properties on an Apache Hive table, but SHOW CREATE TABLE just returns the Apache Spark DDL. It does not show the SerDe properties. For example, given this sample code: %sql SHOW CREATE TABLE <table-identifier> You get a result that does not show the SerDe properties: Cause You are using Databricks Runt...
Parsing post meridiem time (PM) with to_timestamp() returns null
Problem You are trying to parse a 12-hour (AM/PM) time value with to_timestamp(), but instead of returning a 24-hour time value it returns null. For example, this sample code: %sql SELECT to_timestamp('2016-12-31 10:12:00 PM', 'yyyy-MM-dd HH:mm:ss a'); Returns null when run: Cause to_timestamp() requires the hour format to be in lowercase. If the ho...
to_json() results in Cannot use null as map key error
Problem You are using to_json() to convert data to JSON and you get a Cannot use null as map key error: RuntimeException: Cannot use null as map key. Cause to_json() function does not support using null values as the input map keys. This example code causes the Cannot use null as map key error when run, because of the null value used as a map key in...
Set nullability when using SaveAsTable with Delta tables
When creating a Delta table with saveAsTable, the nullability of columns defaults to true (columns can contain null values). This is expected behavior. In some cases, you may want to create a Delta table with the nullability of columns set to false (columns cannot contain null values). Instructions Use the CREATE TABLE command to create the table an...
Ensure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above
Problem The statistics functions covar_samp, kurtosis, skewness, std, stddev, stddev_samp, variance, and var_samp, return NaN when a divide by zero occurs during expression evaluation in Databricks Runtime 7.3 LTS. The same functions return null in Databricks Runtime 9.1 LTS and above, as well as Databricks SQL endpoints when a divide by zero occur...
Using datetime values in Spark 3.0 and above
Problem You are migrating jobs from unsupported clusters running Databricks Runtime 6.6 and below with Apache Spark 2.4.5 and below to clusters running a current version of the Databricks Runtime. If your jobs and/or notebooks process date conversions, they may fail with a SparkUpgradeException error message after running them on upgraded clusters. ...
ANSI compliant DECIMAL precision and scale
Problem You are trying to cast a value of one or greater as a DECIMAL using equal values for both precision and scale. A null value is returned instead of the expected value. This sample code: %sql SELECT CAST (5.345 AS DECIMAL(20,20)) Returns: Cause The DECIMAL type (AWS | Azure | GCP) is declared as DECIMAL(precision, scale), where precision and s...
Error when creating a user, group, or service principal at the account level with Terraform
Problem Unity Catalog uses Databricks account identities to resolve users, service principals, and groups, and to enforce permissions. These identities can be managed using Terraform. You are trying to create users, service principals, or groups at the account level when your Terraform code fails with a set `host` property error message. 2022-10-06T...
Members not supported SCIM provisioning failure
Problem You using SCIM to provision new users on your Databricks workspace when you get a Members attribute not supported for current workspace error. StatusCode: BadRequest Message: Processing of the HTTP request resulted in an exception. Please see the HTTP response returned by the 'Response' property of this exception for details. Web Response: ...
Cannot delete Unity Catalog metastore using Terraform
Problem You cannot delete the Unity Catalog metastore using Terraform. Cause The default catalog is auto-created with a metastore. As a result, you cannot delete the metastore without first wiping the catalog. Solution Set force_destory = true in the databricks_metastore section of the Terraform configuration to delete the metastore and the correspo...
Permission denied error when creating external location
Problem An external location is a storage location, such as an S3 bucket, on which external tables or managed tables can be created. A user or group with permission to use an external location can access any storage path within the external location without direct access to the storage credential. Review the Manage external locations and storage cre...
Overlapping paths error when querying both Hive and Unity Catalog tables
Problem You are running queries when you get an overlapping paths error message. org.apache.spark.SparkException: Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported. Cause An overlapping paths error happens when a single cell in a notebook queries both an Apache Hive table...