Autoscaling is slow with an external metastore

Improve autoscaling performance by only installing metastore jars to the driver.

Written by Gobinath.Viswanathan

Last published at: May 16th, 2022

Problem

You have an external metastore configured on your cluster and autoscaling is enabled, but the cluster is not autoscaling effectively.

Cause

You are copying the metastore jars to every executor, when they are only needed in the driver.

It takes time to initialize and run the jars every time a new executor spins up. As a result, adding more executors takes longer than it should.

Solution

You should configure your cluster so the metastore jars are only copied to the driver.

Option 1: Use an init script to copy the metastore jars.

  1. Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version to match the version of your metastore.
  2. Start the cluster and search the driver logs for a line that includes Downloaded metastore jars to.
    17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
    <path> is the location of the downloaded jars in the driver node of the cluster.
  3. Copy the jars to a DBFS location.
    %sh
    
    cp -r <path> /dbfs/ExternalMetaStore_jar_location
  4. Create the init script.
    %python
    
    dbutils.fs.put("dbfs:/databricks/<init-script-folder>/external-metastore-jars-to-driver.sh",
    """
    #!/bin/bash
    if [[ $DB_IS_DRIVER = "TRUE" ]]; then
    mkdir -p /databricks/metastorejars/
    cp -r /dbfs/ExternalMetaStore_jar_location/* /databricks/metastorejars/
    fi""", True)
  5. Install the init script that you just created as a cluster-scoped init script (AWS | Azure | GCP).
  6. You will need the full path to the location of the script (dbfs:/databricks/<init-script-folder>/external-metastore-jars-to-driver.sh).
  7. Restart the cluster.

Option 2: Use the Apache Spark configuration settings to copy the metastore jars to the driver.

  • Enter the following settings into your Spark config (AWS | Azure | GCP):
    spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db>
    spark.hadoop.javax.jdo.option.ConnectionDriverName <driver>
    spark.hadoop.javax.jdo.option.ConnectionUserName <mysql-username>
    spark.hadoop.javax.jdo.option.ConnectionPassword <mysql-password>
    spark.sql.hive.metastore.version <hive-version>
    spark.sql.hive.metastore.jars /dbfs/metastore/jars/*
  • The source path can be external mounted storage or DBFS.
  • The metastore configuration can be applied globally within the workspace by using cluster policies (AWS | Azure | GCP).

Option 3: Build a custom Databricks container with preloaded jars on AWS or Azure.

Review the documentation on customizing containers with Databricks Container Services.

Was this article helpful?