Problem
You are seeing intermittent Apache Spark job failures on jobs using shuffle fetch.
21/02/01 05:59:55 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 10.79.1.45, executor 0): FetchFailed(BlockManagerId(1, 10.79.1.134, 4048, None), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.79.1.134:4048 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:553) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:484) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:63) ... 1 more Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /10.79.1.134:4048 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
Cause
This can happen if you have modified the Azure Databricks subnet CIDR range after deployment. This behavior is not supported.
Assume the below details describe two scenarios:
Original Azure Databricks subnet CIDR
- Private subnet: 10.10.0.0/24 (10.10.0.0 - 10.10.0.255)
- Public subnet: 10.10.1.0/24 (10.10.1.0 - 10.10.1.255)
Modified Azure Databricks subnet CIDR
- Private subnet: 10.10.0.0/18 (10.10.0.0 - 10.10.63.255)
- Public subnet: 10.10.64.0/24 (10.10.64.0 - 10.10.127.255)
With the original settings, everything works as intended.
With the modified settings, if executors are assigned IP addresses in the subnet range 10.10.1.0 - 10.10.63.255 and the driver assigned an IP address in the subnet range 10.10.0.0 - 10.10.0.255, the communication between executors is blocked due to a firewall rule limiting communication in the original CIDR range of 10.10.0.0/24.
If the executors and driver are both assigned IP addresses in 10.10.0.0/24, no communication is blocked and the job runs as intended. However, this assignment is not guaranteed under the modified settings.
Solution
- Revert any subnet CIDR changes and restore the original VNet configuration that you used to create the Azure Databricks workspace.
- Restart your cluster.
- Resubmit your jobs.