Enable OpenJSSE and TLS 1.3
Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is not encrypted. If you require that data is encrypted at all times, you can encrypt traffic between cluster worker nodes using AES 128 over a TLS 1.2 connection. In some cases, you may want to use TLS 1.3 i...
How to calculate the number of cores in a cluster
You can view the number of cores in a Databricks cluster in the Workspace UI using the Metrics tab on the cluster details page. Note Azure Databricks cluster nodes must have a metrics service installed. If the driver and executors are of the same node type, you can also determine the number of cores available in a cluster programmatically, using Sca...
Install a private PyPI repo
Certain use cases may require you to install libraries from private PyPI repositories. If you are installing from a public repository, you should review the library documentation. This article shows you how to configure an example init script that authenticates and downloads a PyPI library from a private repository. Create init script Create (or ver...
IP access list update returns INVALID_STATE
Problem You are trying to update an IP access list and you get an INVALID_STATE error message. {"error_code":"INVALID_STATE","message":"Your current IP 3.3.3.3 will not be allowed to access the workspace under current configuration"} Cause The IP access list update that you are trying to commit does not include your current public IP address. If you...
Launch fails with Client.InternalError
Problem You deploy a new E2 workspace, but you get cluster launch failures with the message Client.InternalError. Cause You have encryption of the EBS volumes at the AWS account level or you are using a custom KMS key for EBS encryption. Either one of these scenarios can result in a Client.InternalError when you try to create a cluster in an E2 work...
Cannot apply updated cluster policy
Problem You are attempting to update an existing cluster policy, however the update does not apply to the cluster associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not applied or saved. Cause This is a known issue that is being addressed. Solution You can use a workaround until a permanent fix ...
Cluster Apache Spark configuration not applied
Problem Your cluster’s Spark configuration values are not applied. Cause This happens when the Spark config values are declared in the cluster configuration as well as in an init script. When Spark config values are located in more than one place, the configuration in the init script takes precedence and the cluster ignores the configuration setting...
Cluster failed to launch
This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs. Cluster timeout Error messages: Driver failed to start in time INTERNAL_ERROR: The Spark driver failed to start within 300 seconds Cluster failed to be healthy within 200 seconds Cau...
Custom Docker image requires root
Problem You are trying to launch a Databricks cluster with a custom Docker container, but cluster creation fails with an error. { "reason": { "code": "CONTAINER_LAUNCH_FAILURE", "type": "SERVICE_FAULT", "parameters": { "instance_id": "i-xxxxxxx", "databricks_error_message": "Failed to launch spark container on instance i-xxxx. Exception: Could not a...
Job fails due to cluster manager core instance request limit
Problem A Databricks Notebook or Job API returns the following error: Unexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was rejected due to API rate limit. Please retry your request later, or choose a larger node type instead. Cause The error indicates the Cluster Manager Service core instance requ...
Admin user cannot restart cluster to run job
Problem When a user who has permission to start a cluster, such as a Databricks Admin user, submits a job that is owned by a different user, the job fails with the following message: Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The error received from the cluster manager was: 'You are no...
Cluster fails to start with dummy does not exist error
Problem You try to start a cluster, but it fails to start. You get an Apache Spark error message. Internal error message: Spark error: Driver down You review the cluster driver and worker logs and see an error message containing java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist. 21/07/14 21:44:06 ERROR DriverDaemon$: X...
Cluster slowdown due to Ganglia metrics filling root partition
Note This article applies to Databricks Runtime 7.3 LTS and below. Problem Clusters start slowing down and may show a combination of the following symptoms: Unhealthy cluster events are reported: Request timed out. Driver is temporarily unavailable. Metastore is down. DBFS is down. You do not see any high GC events or memory utilization associated w...
Failed to create cluster with invalid tag value
Problem You are trying to create a cluster, but it is failing with an invalid tag value error message. System.Exception: Content={"error_code":"INVALID_PARAMETER_VALUE","message":"\nInvalid tag value (<<<<TAG-VALUE>>>>) - the length cannot exceed 256\nUnicode characters in UTF-8.\n "} Cause Limitations on tag Key and Value ar...
Failed to expand the EBS volume
Problem Databricks jobs fail, due to a lack of space on the disk, even though storage auto-scaling is enabled. When you review the cluster event log, you see a message stating that the instance failed to expand disk due to an authorization error. Instance i-xxxxxxxxx failed to expand disk because: You are not authorized to perform this operation. En...
EBS leaked volumes
Problem After a cluster is terminated on AWS, some EBS volumes are not deleted automatically. These stray, unattached EBS volumes are often referred to as “leaked” volumes. Cause Databricks always sets DeletionOnTermination=true for the EBS volumes it creates when it launches clusters. Therefore, whenever a cluster instance is terminated, AWS should...
Log delivery fails with AssumeRole
Problem You are using AssumeRole to send cluster logs to a S3 bucket in another account and you get an access denied error. Cause AssumeRole does not allow you to send cluster logs to a S3 bucket in another account. This is because the log daemon runs on the host machine. It does not run inside the container. Only items that run inside the container...
Multi-part upload failure
Problem You observe a job failure with the exception: com.amazonaws.SdkClientException: Unable to complete multi-part upload. Individual part upload failed : Unable to execute HTTP request: Timeout waiting for connection from pool org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool ... com.amazonaws.http.Ama...
Persist Apache Spark CSV metrics to a DBFS location
Spark has a configurable metrics system that supports a number of sinks, including CSV files. In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location. Create an init script All of the configuration is done in an init script. The init script does the following thre...
Replay Apache Spark events in a cluster
The Spark UI is commonly used as a debugging tool for Spark jobs. If the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay notebook to replay the Spark events. Warning Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise there ...
S3 connection fails with "No role specified and no roles available"
Problem You are using Databricks Utilities (dbutils) to access a S3 bucket, but it fails with a No role specified and no roles available error. You have confirmed that the instance profile associated with the cluster has the permissions needed to access the S3 bucket. Unable to load AWS credentials from any provider in the chain: [com.databricks.bac...
Set Apache Hadoop core-site.xml properties
You have a scenario that requires Apache Hadoop properties to be set. You would normally do this in the core-site.xml file. In this article, we explain how you can set core-site.xml in a cluster. Create the core-site.xml file in DBFS You need to create a core-site.xml file and save it to DBFS on your cluster. An easy way to create this file is via a...
Set executor log level
Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...
Set instance_profile_arn as optional with a cluster policy
In this article, we review the steps to create a cluster policy for the AWS attribute instance_profile_arn and define it as optional. This allows you to start a cluster with a specific AWS instance profile. You can also start a cluster without an instance profile. Note You must be an admin user in order to manage cluster policies. Create a new clust...
Apache Spark job doesn’t start
Problem No Spark jobs start, and the driver logs contain the following error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Cause This error can occur when the executor memory and number of executor cores are set explicitly on the Spark Config tab. Here is a samp...
Auto termination is disabled when starting a job cluster
Problem You are trying to start a job cluster, but the job creation fails with an error message. Error creating job Cluster autotermination is currently disabled. Cause Job clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination policies. If you include autotermination_minutes in your cluster poli...
Unexpected cluster termination
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for remediation. Databricks ini...
How to configure single-core executors to run JNI libraries
When you create a cluster, Databricks launches one Apache Spark executor instance per worker node, and the executor uses all of the cores on the node. In certain situations, such as if you want to run non-thread-safe JNI libraries, you might need an executor that has only one core or task slot, and does not attempt to run concurrent tasks. In this c...
How to overwrite log4j configurations on Databricks clusters
Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...
Apache Spark executor memory allocation
By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM) memory heap. This is controlled by the spark.executor.memory property. However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As JVMs scale up in memory size, issues with the garbage collecto...
Apache Spark UI shows less than total node memory
Problem The Executors tab in the Spark UI shows less memory than is actually available on the node: AWS An m4.xlarge instance (16 GB ram, 4 core) for the driver node, shows 4.5 GB memory on the Executors tab. An m4.large instance (8 GB ram, 2 core) for the driver node, shows 710 GB memory on the Executors tab: Azure An F8s instance (16 GB, 4 core) f...
Configure a cluster to use a custom NTP server
By default Databricks clusters use public NTP servers. This is sufficient for most use cases, however you can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a private NTP server under your control. A common use case is to minimize the amount of Internet traffic from your cluster. Update the NT...
Enable GCM cipher suites
Databricks clusters do not have GCM (Galois/Counter Mode) cipher suites enabled by default. You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher suites. Verify required cipher suites Use the nmap utility to verify which cipher suites are required by the external server. %sh nmap --script ssl-enu...
Enable retries in init script
Init scripts are commonly used to configure Databricks clusters. There are some scenarios where you may want to implement retries in an init script. Example init script This sample init script shows you how to implement a retry for a basic copy operation. You can use this sample code as a base for implementing retries in your own init script. %scala...