Problem
Your Apache Spark job fails with an error message such as the following.
shaded.databricks.org.apache.hadoop.fs.s3a.DatabricksThrottledException: Instantiate shaded.databricks.org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider on : com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: Rate exceeded (Service: AWSSecurityTokenService; Status Code: 400; Error Code: Throttling; Request ID: XXXXXXXX; Proxy: null)
Cause
There is an excessive number of AssumeRole API calls being made from the same instance type within a short timeframe, leading to throttling by AWS STS.
An excessive number of calls can occur when specific Spark configurations are modified but should not be, such as:
spark.hadoop.fs.s3.impl
spark.hadoop.fs.s3n.impl
spark.hadoop.fs.s3a.impl
Additionally, you may have an STS endpoint set at the global level, which can contribute to the issue and is more costly. For example, using sts.amazonaws.com
instead of the regional endpoint sts.<region>.amazonaws.com
can lead to failures in Spark runtime.
Solution
- Remove the following Spark configurations from all clusters, including interactive and job clusters.
spark.hadoop.fs.s3.impl
spark.hadoop.fs.s3n.impl
spark.hadoop.fs.s3a.impl
- Ensure that the STS endpoint is set correctly at the regional level. Use the regional endpoint format:
sts.<region>.amazonaws.com
. - Monitor the STS calls to ensure that the changes have resolved the throttling issue.
For further information, refer to the Configure a customer-managed VPC documentation.