Jobs running longer than expected with 'Metastore_Down' events in event log

Run the VACUUM command to remove stale files, adjust the catalog update thread pool size in Databricks Runtime 14.3 LTS and above, or for read-only metastore databases, disable Delta catalog updates.

Written by manikandan.ganesan

Last published at: January 29th, 2025

Problem

You have jobs in Databricks that run for longer than expected. When you check the event log, you see a Metastore_Down event. This happens when you use Hive or an external metastore like AWS Glue.

When you analyze the thread dump, you find threads stuck at delta-catalog-update.

 

Sample thread

 

delta-catalog-update-8" #518 daemon prio=5 os_prio=0 tid=xxx nid=xxx waiting on condition [xxx]
  java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <xxx> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
at org.spark_project.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:590)
at org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:432)
at org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:349)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool.super$borrowObject(LocalHiveClientImpl.scala:124)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool.$anonfun$borrowObject$1(LocalHiveClientImpl.scala:124)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool$$Lambda$5460/xxx.apply(Unknown Source)
at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:394)
at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool.borrowObject(LocalHiveClientImpl.scala:122)
at org.apache.spark.sql.hive.client.PoolingHiveClient.retain(PoolingHiveClient.scala:181)
at org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:110)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$1(HiveExternalCatalog.scala:150)
at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$5186/xxx.apply(Unknown Source)
at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:394)
at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:149)
at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:1027)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:154)
at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.tableExists(SessionCatalog.scala:936)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.tableExists(ManagedCatalogSessionCatalog.scala:763)
at com.databricks.sql.transaction.tahoe.hooks.UpdateCatalog.tableStillExists$1(UpdateCatalog.scala:112)

 

Cause

This happens when catalog update operations saturate the Hive client thread pool. The delta update threads can exhaust all Hive client connections, which prevents other query operations, and results in hanging jobs. This usually occurs if there is an update to the table metadata in the catalog through the ALTER TABLE command. 

 

Solution

There are three options to try depending on your case. 

 

Run the VACUUM command 

  1. Check if there are a large number of files for the table. 
  2. Periodically run a vacuum on Delta tables to remove stale and unreferenced files, which can help in reducing the load on the metastore. 

 

Adjust catalog update thread pool size

In Databricks Runtime 14.3 LTS and above, you can control the size of the thread pool used to update the catalog. To set this configuration, adjust spark.databricks.delta.catalog.update.threadPoolSize to a value less than the default of 20

 

spark.databricks.delta.catalog.update.threadPoolSize <value-less-than-20>

 

Disable Delta catalog update  

If you’re using a read-only metastore database, Databricks recommends setting the following configuration on your clusters. This configuration controls the syncing of the most recent schema and table properties of a Delta table with the Hive metastore (or any external catalog) to ensure both of them stay the same.

 

spark.databricks.delta.catalog.update.enabled false

 

Important

If other systems access your external metastore for this table schema or table properties, do not use this option. Keep enabled set to true to ensure they sync.