Common errors using Azure Data Factory

Learn about solutions and explanations for common errors when using Azure Data Factory with Azure Databricks.

Written by Adam Pavlacka

Last published at: February 23rd, 2023


Azure Data Factory is a managed service that lets you author data pipelines using Azure Databricks notebooks, JARs, and Python scripts. This article describes common issues and solutions.

Cluster could not be created

When you create a data pipeline in Azure Data Factory that uses an Azure Databricks-related activity such as Notebook Activity, you can ask for a new cluster to be created. In Azure, cluster creation can fail for a variety of reasons:

  • Your Azure subscription is limited in the number of virtual machines that can be provisioned.
  • Failed to create cluster because of Azure quota indicates that the subscription you are using does not have enough quota to create the needed resources. For example, if you request 500 cores but your quota is 50 cores, the request will fail. Contact Azure Support to request a quota increase.
  • Azure resource provider is currently under high load and requests are being throttled. This error indicates that your Azure subscription or perhaps even the region is being throttled. Simply retrying the data pipeline may not help. Learn more about this issue at Troubleshooting API throttling errors.
  • Could not launch cluster due to cloud provider failures indicates a generic failure to provision one or more virtual machines for the cluster. Wait and try again later.

Cluster ran into issues during data pipeline execution

Azure Databricks includes a variety of mechanisms that increase the resilience of your Apache Spark cluster. That said, it cannot recover from every failure, leading to errors like this:

  • Connection refused
  • RPC timed out
  • Exchange times out after X seconds
  • Cluster became unreachable during run
  • Too many execution contexts are open right now
  • Driver was restarted during run
  • Context ExecutionContextId is disconnected
  • Could not reach driver of cluster for X seconds

Most of the time, these errors do not indicate an issue with the underlying infrastructure of Azure. Instead, it is quite likely that the cluster has too many jobs running on it, which can overload the cluster and cause timeouts.

As a general rule, you should move heavier data pipelines to run on their own Azure Databricks clusters. Integrating with Azure Monitor and observing execution metrics with Grafana can provide insight into clusters that are getting overloaded.

Azure Databricks service is experiencing high load

You may notice that certain data pipelines fail with errors like these:

  • The service at {API} is temporarily unavailable
  • Jobs is not fully initialized yet. Please retry later
  • Failed or timeout processing HTTP request
  • No webapps are available to handle your request

These errors indicate that the Azure Databricks service is under heavy load. If this happens, try limiting the number of concurrent data pipelines that include a Azure Databricks activity. For example, if you are performing ETL with 1,000 tables from source to destination, instead of launching a data pipeline per table, either combine multiple tables in one data pipeline or stagger their execution so they don’t all trigger at once.

Delete

Info

Azure Databricks will not allow you to create more than 1,000 Jobs in a 3,600 second window. If you try to do so with Azure Data Factory, your data pipeline will fail.

These errors can also show if you poll the Databricks Jobs API for job run status too frequently (e.g. every 5 seconds). The remedy is to reduce the frequency of polling.

Library installation timeout

Azure Databricks includes robust support for installing third-party libraries. Unfortunately, you may see issues like this:

  • Failed or timed out installing libraries

This happens because every time you start a cluster with a library attached, Azure Databricks downloads the library from the appropriate repository (such as PyPI). This operation can time out, causing your cluster to fail to start.

There is no simple solution for this problem, other than limiting the number of libraries you attach to clusters.