Your all-purpose clusters are failing to launch, and your jobs are failing to run. You are seeing error messages related to the Apache Spark image not downloading or not existing.
Spark image download failure.
Spark image failed to download or does not exist.
A Spark image download failure error usually indicates network latency or configuration problems that are blocking traffic to the storage account hosting the Spark image.
Check your Databricks VNet configuration to make sure traffic to your storage is not blocked by any of the following:
- DNS - By default, resources deployed to a VNet use Azure DNS for domain name resolution. If you are using a custom DNS server, you must configure your custom DNS to forward these requests to the Azure recursive resolver (126.96.36.199) to resolve the IP addresses for Azure artifacts. Review the Configure custom DNS documentation for more information.
You can check your DNS settings in the Azure portal. From there, navigate to your Databricks VNet and select "DNS servers" from the "Settings" menu. Then you can add Azure recursive resolver IP address to the list of DNS servers as in screenshot below.
- Firewall - If you have a firewall enabled on the VNet, review the settings and ensure it is not blocking traffic to storage.
You can check your firewall settings in the Azure portal. From there, navigate to your Databricks VNet and select "Firewall" from the "Settings" menu. If firewall rules have been created, you can view them there.
- Network Security Group (NSG) - Verify the network security group includes all required NSG rules.
You can check your NSG rules settings by going to the Azure portal and navigating to your VNet. From there, select "Subnets" from the "Settings" menu, and choose the security group for both subnets. Double check both inbound and outbound rules and confirm all required ones are added and traffic to Storage is not blocked.
- User-defined routes and service endpoints - Verify that the route table includes all required user-defined routes. If you use service endpoints rather than user-defined routes for Blob storage, check those endpoints as well.
You can check your UDR settings by going to the Azure portal and navigating to your VNet. From there, select "Subnets" from the "Settings" menu, and choose the Route table for both subnets. Double check the routes and confirm all required ones are added and traffic to Storage is not blocked.
For service endpoints, from the VNet page, select "Service endpoints" from the "Settings" menu. If service endpoints have been created, you can view them there.
On the other hand, as a best practice, consider setting up a Disaster recovery solution to minimize any service issues impact. Ensure the right people in your organization are notified about any service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more.