Databricks Knowledge Base

Main Navigation

  • Help Center
  • Documentation
  • Knowledge Base
  • Community
  • Training
  • Feedback

Jobs (AWS)

These articles can help you with your Databricks jobs.

21 Articles in this category

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.

  • Home
  • Amazon
  • Jobs (AWS)

Distinguish active and dead jobs

Problem On clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages. Cause Whenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal eventListenerBus...

Last updated: May 10th, 2022 by Adam Pavlacka

Spark job fails with Driver is temporarily unavailable

Problem A Databricks notebook returns the following error: Driver is temporarily unavailable This issue can be intermittent or not. A related error message is: Lost connection to cluster. The notebook may have been detached. Cause One common cause for this error is that the driver is undergoing a memory bottleneck. When this happens, the driver cras...

Last updated: May 10th, 2022 by Adam Pavlacka

How to delete all jobs using the REST API

Run the following commands to delete all jobs in a Databricks workspace. Identify the jobs to delete and list them in a text file:%sh curl -X GET -u "Bearer: <token>" https://<databricks-instance>/api/2.0/jobs/list | grep -o -P 'job_id.{0,6}' | awk -F':' '{print $2}' >> job_id.txt Run the curlcommand in a loop to delete the identif...

Last updated: May 10th, 2022 by Adam Pavlacka

Identify less used jobs

The workspace has a limit on the number of jobs that can be shown in the UI. The current job limit is 1000. If you exceed the job limit, you receive a QUOTA_EXCEEDED error message. 'error_code':'QUOTA_EXCEEDED','message':'The quota for the number of jobs has been reached. The current quota is 1000. This quota is only applied to jobs created through ...

Last updated: May 10th, 2022 by Adam Pavlacka

Job cluster limits on notebook output

Problem You are running a notebook on a job cluster and you get an error message indicating that the output is too large. The output of the notebook is too large. Cause: rpc response (of 20975548 bytes) exceeds limit of 20971520 bytes Cause This error message can occur in a job cluster whenever the notebook output is greater then 20 MB. If you are u...

Last updated: May 10th, 2022 by Jose Gonzalez

Job fails, but Apache Spark tasks finish

Problem Your Databricks job reports a failed status, but all Spark jobs and tasks have successfully completed. Cause You have explicitly called spark.stop() or System.exit(0) in your code. If either of these are called, the Spark context is stopped, but the graceful shutdown and handshake with the Databricks job service does not happen. Solution Do ...

Last updated: May 10th, 2022 by harikrishnan.kunhumveettil

Job fails due to job rate limit

Problem A Databricks notebook or Jobs API request returns the following error: Error : {"error_code":"INVALID_STATE","message":"There were already 1000 jobs created in past 3600 seconds, exceeding rate limit: 1000 job creations per 3600 seconds."} Cause This error occurs because the number of jobs per hour exceeds the limit of 1000 established by Da...

Last updated: May 10th, 2022 by Adam Pavlacka

Create table in overwrite mode fails when interrupted

Problem When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs: Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`'). The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.; Caus...

Last updated: May 10th, 2022 by Adam Pavlacka

Apache Spark Jobs hang due to non-deterministic custom UDF

Problem Sometimes Apache Spark jobs hang indefinitely due to the non-deterministic behavior of a Spark User-Defined Function (UDF). Here is an example of such a function: %scala val convertorUDF = (commentCol: String) =>     {               #UDF definition     } val translateColumn = udf(convertorUDF) If you call this UDF using the withColumn() A...

Last updated: May 10th, 2022 by Adam Pavlacka

Apache Spark job fails with Failed to parse byte string

Problem Spark-submit jobs fail with a Failed to parse byte string: -1 error message. java.util.concurrent.ExecutionException: java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m. Failed to parse byte string: -1 at java.util.concurre...

Last updated: May 10th, 2022 by noopur.nigam

Apache Spark UI shows wrong number of jobs

Problem You are reviewing the number of active Apache Spark jobs on a cluster in the Spark UI, but the number is too high to be accurate. If you restart the cluster, the number of jobs shown in the Spark UI is correct at first, but over time it grows abnormally high. Cause The Spark UI is not always accurate for large, or long-running, clusters due ...

Last updated: May 11th, 2022 by ashish

Apache Spark job fails with a Connection pool shut down error

Problem A Spark job fails with the error message java.lang.IllegalStateException: Connection pool shut down when attempting to write data into a Delta table on S3. Cause Spark jobs writing to S3 are limited to a maximum number of simultaneous connections. The java.lang.IllegalStateException: Connection pool shut down occurs when this connection pool...

Last updated: May 11th, 2022 by noopur.nigam

Job fails with atypical errors message

Problem Your job run fails with a throttled due to observing atypical errors error message. Cluster became unreachable during run Cause: xxx-xxxxxx-xxxxxxx is throttled due to observing atypical errors Cause The jobs on this cluster have returned too many large results to the Apache Spark driver node. As a result, the chauffeur service runs out of m...

Last updated: May 11th, 2022 by Adam Pavlacka

Apache Spark job fails with maxResultSize exception

Problem A Spark job fails with a maxResultSize exception: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB) Cause This error occurs because the configured size limit was exceeded. The size limit applies to the total serialized ...

Last updated: May 11th, 2022 by Adam Pavlacka

Databricks job fails because library is not installed

Problem A Databricks job fails because the job requires a library that is not yet installed, causing Import errors. Cause The error occurs because the job starts running before required libraries install. If you run a job on a cluster in either of the following situations, the cluster can experience a delay in installing libraries: When you start an...

Last updated: May 11th, 2022 by Adam Pavlacka

Job failure due to Azure Data Lake Storage (ADLS) CREATE limits

Problem When you run a job that involves creating files in Azure Data Lake Storage (ADLS), either Gen1 or Gen2, the following exception occurs: Caused by: java.io.IOException: CREATE failed with error 0x83090c25 (Files and folders are being created at too high a rate). [745c5836-264e-470c-9c90-c605f1c100f5] failed with error 0x83090c25 (Files and fo...

Last updated: May 11th, 2022 by Adam Pavlacka

Job fails with invalid access token

Problem Long running jobs, such as streaming jobs, fail after 48 hours when using dbutils.secrets.get() (AWS | Azure | GCP). For example: %python streamingInputDF1 = (      spark     .readStream                            .format("delta")                    .table("default.delta_sorce")   ) def writeIntodelta(batchDF, batchId):   table_name = dbutil...

Last updated: May 11th, 2022 by manjunath.swamy

How to ensure idempotency for jobs

When you submit jobs through the Databricks Jobs REST API, idempotency is not guaranteed. If the client request is timed out and the client resubmits the same request, you may end up with duplicate jobs running. To ensure job idempotency when you submit jobs through the Jobs API, you can use an idempotency token to define a unique value for a specif...

Last updated: May 11th, 2022 by Adam Pavlacka

Monitor running jobs with a Job Run dashboard

The Job Run dashboard is a notebook that displays information about all of the jobs currently running in your workspace. To configure the dashboard, you must have permission to attach a notebook to an all-purpose cluster in the workspace you want to monitor. If an all-purpose cluster does not exist, you must have permission to create one. Once the d...

Last updated: May 11th, 2022 by Adam Pavlacka

Streaming job has degraded performance

Problem You have a streaming job which has its performance degrade over time. You start a new streaming job with the same configuration and same source, and it performs better than the existing job. Cause Issues with old checkpoints can result in performance degradation in long running streaming jobs. This can happen if the job was intermittently ha...

Last updated: May 11th, 2022 by ashish

Task deserialization time is high

Problem Your tasks are running slower than expected. You review the stage details in the Spark UI on your cluster and see that task deserialization time is high. Cause Cluster-installed libraries (AWS | Azure | GCP) are only installed on the driver when the cluster is started. These libraries are only installed on the executors when the first tasks ...

Last updated: May 11th, 2022 by Adam Pavlacka


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use

Definition by Author

0
0