Increased job execution time after migrating from all-purpose to job cluster

Increase the Hive client pool size in the job cluster configuration to match the previous all-purpose compute setting.

Last published at: February 27th, 2025

Problem

Your Apache Spark job, which involves several metastore-related operations such as ALTER or MSCK REPAIR, runs longer after migrating from all-purpose compute to a job cluster. You also notice a gap between Spark job executions.

When you analyze the thread dump, you find threads stuck at HiveClientImpl.

Sample thread

Thread-119" #277 daemon prio=5 os_prio=0 tid=XXXX nid=XXX waiting on condition [XXX]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <XXXX> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
	at org.spark_project.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:1323)
	at org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
	at org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
	at org.apache.spark.sql.hive.client.LocalHiveClientsPool.super$borrowObject(LocalHiveClientImpl.scala:124)
	at org.apache.spark.sql.hive.client.LocalHiveClientsPool.$anonfun$borrowObject$1(LocalHiveClientImpl.scala:124)
	at org.apache.spark.sql.hive.client.LocalHiveClientsPool$$Lambda$5190/556896297.apply(Unknown Source)
	at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:410)
	at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:396)
	at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
	at org.apache.spark.sql.hive.client.LocalHiveClientsPool.borrowObject(LocalHiveClientImpl.scala:122)
	at org.apache.spark.sql.hive.client.PoolingHiveClient.retain(PoolingHiveClient.scala:181)
	at org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:113)
…

Cause

The Hive client pool size in the job cluster is limited to 1, compared to 20 for all purpose compute. This pool size difference causes a bottleneck in the execution when there are a lot of operations involving the metastore.

Solution

In your job cluster settings, under the Advanced options accordion in the Spark tab, set the following configurations in the Spark config box to increase the Hive client pool size to align it with the previous all-purpose compute size.

spark.databricks.hive.metastore.client.pool.size 20
spark.databricks.clusterSource API

Databricks Help Center

Problem

Sample thread

Cause

Solution

Contact Us