Error when running MSCK REPAIR TABLE in parallel

Do not run `MSCK REPAIR` commands in parallel. It results in a read timed out or out of memory error message.

Written by ashritha.laxminarayana

Last published at: May 23rd, 2022

Problem

You are trying to run MSCK REPAIR TABLE <table-name> commands for the same table in parallel and are getting java.net.SocketTimeoutException: Read timed out or out of memory error messages.

Cause

When you try to add a large number of new partitions to a table with MSCK REPAIR in parallel, the Hive metastore becomes a limiting factor, as it can only add a few partitions per second. The greater the number of new partitions, the more likely that a query will fail with a java.net.SocketTimeoutException: Read timed out error or an out of memory error message.

Solution

You should not attempt to run multiple MSCK REPAIR TABLE <table-name> commands in parallel.

Databricks uses multiple threads for a single MSCK REPAIR by default, which splits createPartitions() into batches. By limiting the number of partitions created, it prevents the Hive metastore from timing out or hitting an out of memory error. It also gathers the fast stats (number of files and the total size of files) in parallel, which avoids the bottleneck of listing the metastore files sequentially. This is controlled by spark.sql.gatherFastStats, which is enabled by default.