Databricks Knowledge Base

Main Navigation

  • Help Center
  • Documentation
  • Knowledge Base
  • Community
  • Training
  • Feedback

Clusters

These articles can help you manage your Apache Spark clusters.

34 Articles in this category

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.

  • Home
  • All articles
  • Clusters

Enable OpenJSSE and TLS 1.3

Queries and transformations are encrypted before being send to your clusters. By default, the data exchanged between worker nodes in a cluster is not encrypted. If you require that data is encrypted at all times, you can encrypt traffic between cluster worker nodes using AES 128 over a TLS 1.2 connection. In some cases, you may want to use TLS 1.3 i...

Last updated: March 2nd, 2022 by Adam Pavlacka

How to calculate the number of cores in a cluster

You can view the number of cores in a Databricks cluster in the Workspace UI using the Metrics tab on the cluster details page. Note Azure Databricks cluster nodes must have a metrics service installed. If the driver and executors are of the same node type, you can also determine the number of cores available in a cluster programmatically, using Sca...

Last updated: March 2nd, 2022 by Adam Pavlacka

Install a private PyPI repo

Certain use cases may require you to install libraries from private PyPI repositories. If you are installing from a public repository, you should review the library documentation. This article shows you how to configure an example init script that authenticates and downloads a PyPI library from a private repository. Create init script Create (or ver...

Last updated: March 4th, 2022 by darshan.bargal

IP access list update returns INVALID_STATE

Problem You are trying to update an IP access list and you get an INVALID_STATE error message. {"error_code":"INVALID_STATE","message":"Your current IP 3.3.3.3 will not be allowed to access the workspace under current configuration"} Cause The IP access list update that you are trying to commit does not include your current public IP address. If you...

Last updated: March 4th, 2022 by Gobinath.Viswanathan

Launch fails with Client.InternalError

Problem You deploy a new E2 workspace, but you get cluster launch failures with the message Client.InternalError. Cause You have encryption of the EBS volumes at the AWS account level or you are using a custom KMS key for EBS encryption. Either one of these scenarios can result in a Client.InternalError when you try to create a cluster in an E2 work...

Last updated: March 4th, 2022 by satyadeepak.bollineni

Cannot apply updated cluster policy

Problem You are attempting to update an existing cluster policy, however the update does not apply to the cluster associated with the policy. If you attempt to edit a cluster that is managed by a policy, the changes are not applied or saved. Cause This is a known issue that is being addressed. Solution You can use a workaround until a permanent fix ...

Last updated: March 4th, 2022 by jordan.hicks

Cluster Apache Spark configuration not applied

Problem Your cluster’s Spark configuration values are not applied. Cause This happens when the Spark config values are declared in the cluster configuration as well as in an init script. When Spark config values are located in more than one place, the configuration in the init script takes precedence and the cluster ignores the configuration setting...

Last updated: March 4th, 2022 by Gobinath.Viswanathan

Cluster failed to launch

This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs. Cluster timeout Error messages: Driver failed to start in time INTERNAL_ERROR: The Spark driver failed to start within 300 seconds Cluster failed to be healthy within 200 seconds Cau...

Last updated: March 4th, 2022 by Adam Pavlacka

Custom Docker image requires root

Problem You are trying to launch a Databricks cluster with a custom Docker container, but cluster creation fails with an error. { "reason": { "code": "CONTAINER_LAUNCH_FAILURE", "type": "SERVICE_FAULT", "parameters": { "instance_id": "i-xxxxxxx", "databricks_error_message": "Failed to launch spark container on instance i-xxxx. Exception: Could not a...

Last updated: March 4th, 2022 by dayanand.devarapalli

Job fails due to cluster manager core instance request limit

Problem A Databricks Notebook or Job API returns the following error: Unexpected failure while creating the cluster for the job. Cause REQUEST_LIMIT_EXCEEDED: Your request was rejected due to API rate limit. Please retry your request later, or choose a larger node type instead. Cause The error indicates the Cluster Manager Service core instance requ...

Last updated: March 4th, 2022 by Adam Pavlacka

Admin user cannot restart cluster to run job

Problem When a user who has permission to start a cluster, such as a Databricks Admin user, submits a job that is owned by a different user, the job fails with the following message: Message: Run executed on existing cluster ID <cluster id> failed because of insufficient permissions. The error received from the cluster manager was: 'You are no...

Last updated: March 4th, 2022 by Adam Pavlacka

Cluster fails to start with dummy does not exist error

Problem You try to start a cluster, but it fails to start. You get an Apache Spark error message. Internal error message: Spark error: Driver down You review the cluster driver and worker logs and see an error message containing java.io.FileNotFoundException: File file:/databricks/driver/dummy does not exist. 21/07/14 21:44:06 ERROR DriverDaemon$: X...

Last updated: March 4th, 2022 by arvind.ravish

Cluster slowdown due to Ganglia metrics filling root partition

Note This article applies to Databricks Runtime 7.3 LTS and below. Problem Clusters start slowing down and may show a combination of the following symptoms: Unhealthy cluster events are reported: Request timed out. Driver is temporarily unavailable. Metastore is down. DBFS is down. You do not see any high GC events or memory utilization associated w...

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan

Failed to create cluster with invalid tag value

Problem You are trying to create a cluster, but it is failing with an invalid tag value error message. System.Exception: Content={"error_code":"INVALID_PARAMETER_VALUE","message":"\nInvalid tag value (<<<<TAG-VALUE>>>>) - the length cannot exceed 256\nUnicode characters in UTF-8.\n "} Cause Limitations on tag Key and Value ar...

Last updated: March 4th, 2022 by kavya.parag

Failed to expand the EBS volume

Problem Databricks jobs fail, due to a lack of space on the disk, even though storage auto-scaling is enabled. When you review the cluster event log, you see a message stating that the instance failed to expand disk due to an authorization error. Instance i-xxxxxxxxx failed to expand disk because: You are not authorized to perform this operation. En...

Last updated: March 4th, 2022 by Adam Pavlacka

EBS leaked volumes

Problem After a cluster is terminated on AWS, some EBS volumes are not deleted automatically. These stray, unattached EBS volumes are often referred to as “leaked” volumes. Cause Databricks always sets DeletionOnTermination=true for the EBS volumes it creates when it launches clusters. Therefore, whenever a cluster instance is terminated, AWS should...

Last updated: March 4th, 2022 by Adam Pavlacka

Log delivery fails with AssumeRole

Problem You are using AssumeRole to send cluster logs to a S3 bucket in another account and you get an access denied error. Cause AssumeRole does not allow you to send cluster logs to a S3 bucket in another account. This is because the log daemon runs on the host machine. It does not run inside the container. Only items that run inside the container...

Last updated: March 4th, 2022 by dayanand.devarapalli

Multi-part upload failure

Problem You observe a job failure with the exception: com.amazonaws.SdkClientException: Unable to complete multi-part upload. Individual part upload failed : Unable to execute HTTP request: Timeout waiting for connection from pool org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool ... com.amazonaws.http.Ama...

Last updated: March 4th, 2022 by Adam Pavlacka

Persist Apache Spark CSV metrics to a DBFS location

Spark has a configurable metrics system that supports a number of sinks, including CSV files. In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location. Create an init script All of the configuration is done in an init script. The init script does the following thre...

Last updated: March 4th, 2022 by Adam Pavlacka

Replay Apache Spark events in a cluster

The Spark UI is commonly used as a debugging tool for Spark jobs. If the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay notebook to replay the Spark events. Warning Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise there ...

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan

S3 connection fails with "No role specified and no roles available"

Problem You are using Databricks Utilities (dbutils) to access a S3 bucket, but it fails with a No role specified and no roles available error. You have confirmed that the instance profile associated with the cluster has the permissions needed to access the S3 bucket. Unable to load AWS credentials from any provider in the chain: [com.databricks.bac...

Last updated: March 4th, 2022 by pavan.kumarchalamcharla

Set Apache Hadoop core-site.xml properties

You have a scenario that requires Apache Hadoop properties to be set. You would normally do this in the core-site.xml file. In this article, we explain how you can set core-site.xml in a cluster. Create the core-site.xml file in DBFS You need to create a core-site.xml file and save it to DBFS on your cluster. An easy way to create this file is via a...

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan

Set executor log level

Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...

Last updated: March 4th, 2022 by Adam Pavlacka

Set instance_profile_arn as optional with a cluster policy

In this article, we review the steps to create a cluster policy for the AWS attribute instance_profile_arn and define it as optional. This allows you to start a cluster with a specific AWS instance profile. You can also start a cluster without an instance profile. Note You must be an admin user in order to manage cluster policies. Create a new clust...

Last updated: March 4th, 2022 by ravirahul.padmanabhan

Apache Spark job doesn’t start

Problem No Spark jobs start, and the driver logs contain the following error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Cause This error can occur when the executor memory and number of executor cores are set explicitly on the Spark Config tab. Here is a samp...

Last updated: March 4th, 2022 by Adam Pavlacka

Auto termination is disabled when starting a job cluster

Problem You are trying to start a job cluster, but the job creation fails with an error message. Error creating job Cluster autotermination is currently disabled. Cause Job clusters auto terminate once the job is completed. As a result, they do not support explicit autotermination policies. If you include autotermination_minutes in your cluster poli...

Last updated: March 4th, 2022 by navya.athiraram

Unexpected cluster termination

Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for remediation. Databricks ini...

Last updated: March 4th, 2022 by Adam Pavlacka

How to configure single-core executors to run JNI libraries

When you create a cluster, Databricks launches one Apache Spark executor instance per worker node, and the executor uses all of the cores on the node. In certain situations, such as if you want to run non-thread-safe JNI libraries, you might need an executor that has only one core or task slot, and does not attempt to run concurrent tasks. In this c...

Last updated: March 4th, 2022 by Adam Pavlacka

How to overwrite log4j configurations on Databricks clusters

Warning This article describes steps related to customer use of Log4j 1.x within a Databricks cluster. Log4j 1.x is no longer maintained and has three known CVEs (CVE-2021-4104, CVE-2020-9488, and CVE-2019-17571). If your code uses one of the affected classes (JMSAppender or SocketServer), your use may potentially be impacted by these vulnerabilitie...

Last updated: March 4th, 2022 by Adam Pavlacka

Apache Spark executor memory allocation

By default, the amount of memory available for each executor is allocated within the Java Virtual Machine (JVM) memory heap. This is controlled by the spark.executor.memory property. However, some unexpected behaviors were observed on instances with a large amount of memory allocated. As JVMs scale up in memory size, issues with the garbage collecto...

Last updated: March 4th, 2022 by Adam Pavlacka

Apache Spark UI shows less than total node memory

Problem The Executors tab in the Spark UI shows less memory than is actually available on the node: AWS An m4.xlarge instance (16 GB ram, 4 core) for the driver node, shows 4.5 GB memory on the Executors tab. An m4.large instance (8 GB ram, 2 core) for the driver node, shows 710 GB memory on the Executors tab: Azure An F8s instance (16 GB, 4 core) f...

Last updated: March 4th, 2022 by Adam Pavlacka

Configure a cluster to use a custom NTP server

By default Databricks clusters use public NTP servers. This is sufficient for most use cases, however you can configure a cluster to use a custom NTP server. This does not have to be a public NTP server. It can be a private NTP server under your control. A common use case is to minimize the amount of Internet traffic from your cluster. Update the NT...

Last updated: March 4th, 2022 by Adam Pavlacka

Enable GCM cipher suites

Databricks clusters do not have GCM (Galois/Counter Mode) cipher suites enabled by default. You must enable GCM cipher suites on your cluster to connect to an external server that requires GCM cipher suites. Verify required cipher suites Use the nmap utility to verify which cipher suites are required by the external server. %sh nmap --script ssl-enu...

Last updated: March 4th, 2022 by Adam Pavlacka

Enable retries in init script

Init scripts are commonly used to configure Databricks clusters. There are some scenarios where you may want to implement retries in an init script. Example init script This sample init script shows you how to implement a retry for a basic copy operation. You can use this sample code as a base for implementing retries in your own init script. %scala...

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use

Definition by Author

0
0