Jobs failing with shuffle fetch failures

Shuffle fetch failures can happen if you have modified the Azure Databricks subnet CIDR range after deployment.

Written by arjun.kaimaparambilrajan

Last published at: February 23rd, 2023

Problem

You are seeing intermittent Apache Spark job failures on jobs using shuffle fetch.

21/02/01 05:59:55 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 10.79.1.45, executor 0): FetchFailed(BlockManagerId(1, 10.79.1.134, 4048, None), shuffleId=1, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.79.1.134:4048
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:553)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:484)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:63)
... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.79.1.134:4048
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

Cause

This can happen if you have modified the Azure Databricks subnet CIDR range after deployment. This behavior is not supported.

Assume the below details describe two scenarios:

Original Azure Databricks subnet CIDR

  • Private subnet: 10.10.0.0/24 (10.10.0.0 - 10.10.0.255)
  • Public subnet: 10.10.1.0/24 (10.10.1.0 - 10.10.1.255)

Modified Azure Databricks subnet CIDR

  • Private subnet: 10.10.0.0/18 (10.10.0.0 - 10.10.63.255)
  • Public subnet: 10.10.64.0/24 (10.10.64.0 - 10.10.127.255)

With the original settings, everything works as intended.

With the modified settings, if executors are assigned IP addresses in the subnet range 10.10.1.0 - 10.10.63.255 and the driver assigned an IP address in the subnet range 10.10.0.0 - 10.10.0.255, the communication between executors is blocked due to a firewall rule limiting communication in the original CIDR range of 10.10.0.0/24.

If the executors and driver are both assigned IP addresses in 10.10.0.0/24, no communication is blocked and the job runs as intended. However, this assignment is not guaranteed under the modified settings.

Solution

  1. Revert any subnet CIDR changes and restore the original VNet configuration that you used to create the Azure Databricks workspace.
  2. Restart your cluster.
  3. Resubmit your jobs.