Connection retries take a long time to fail

The default Apache Hadoop values for connection timeout and retry are high, reduce the values for quicker failures.

Written by sivaprasad.cs

Last published at: December 21st, 2022

Problem

You are trying to access a table on a remote HDFS location or an object store that you do not have permission to access. The SELECT command should fail, and it does, but it does not fail quickly. It can take up to ten minutes, sometimes more, to return a ConnectTimeoutException error message.

The error message they eventually receive is : "
Error in SQL statement: ConnectTimeoutException: Call From 1006-163012-faded894-10-133-241-86/127.0.1.1 to analytics.aws.healthverity.com:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=analytics.aws.healthverity.com/10.24.12.199:8020]; For more details see: SocketTimeout - HADOOP2 - Apache Software Foundation "

Cause

Everything is working as designed, however the default Apache Hadoop values for connection timeout and retry are high, which is why the connection does not fail quickly.

ipc.client.connect.timeout 20000
ipc.client.connect.max.retries.on.timeouts 45

Review the complete list of Hadoop common core-default.xml values.

Review the SocketTimeout documentation for more details.

Solution

You can resolve the issue by reducing the values for connection timeout and retry.

  • The ipc.client.connect.timeout value is in seconds.
  • The ipc.client.connect.max.retries.on.timeouts value is the number of times to retry before failing.

Set these values in your cluster's Spark config (AWS | Azure).

If you are not sure what values to use, these are Databricks recommended values:

ipc.client.connect.timeout 5000
ipc.client.connect.max.retries.on.timeouts 3