Persist Apache Spark CSV metrics to a DBFS location

Persist Spark CSV metrics to a sink in a DBFS location.

Written by Adam Pavlacka

Last published at: March 4th, 2022

Spark has a configurable metrics system that supports a number of sinks, including CSV files.

In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location.

Create an init script

All of the configuration is done in an init script.

The init script does the following three things:

  1. Configures the cluster to generate CSV metrics on both the driver and the worker.
  2. Writes the CSV metrics to a temporary, local folder.
  3. Uploads the CSV metrics from the temporary, local folder to the chosen DBFS location.
Delete

Note

The CSV metrics are saved locally before being uploaded to the DBFS location because DBFS is not designed for a large number of random writes.

Customize the sample code and then run it in a notebook to create an init script on your cluster.

Sample code to create an init script:

%python

dbutils.fs.put("/<init-path>/metrics.sh","""
#!/bin/bash
mkdir /tmp/csv
sudo bash -c "cat <<EOF >> /databricks/spark/dbconf/log4j/master-worker/metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
master.source.jvm.class org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"

sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
driver.source.jvm.class org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"

cat <<'EOF' >> /tmp/asynccode.sh
#!/bin/bash
DB_CLUSTER_ID=$(echo $HOSTNAME | awk -F '-' '{print$1"-"$2"-"$3}')
MYIP=$(hostname -I)
if [[ ! -d /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP} ]] ; then
sudo mkdir -p /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP}
fi
while true; do
    if [ -d "/tmp/csv" ]; then
        sudo cp -r /tmp/csv/* /dbfs/<metrics-path>/$DB_CLUSTER_ID/metrics-$MYIP
  fi
  sleep 5
done
EOF
chmod a+x /tmp/asynccode.sh
/tmp/asynccode.sh & disown
""", True)

Replace <init-path> with the DBFS location you want to use to save the init script.

Replace <metrics-path> with the DBFS location you want to use to save the CSV metrics.

Cluster-scoped init script

Once you have created the init script on your cluster, you must configure it as a cluster-scoped init script.

Verify that CSV metrics are correctly written

Restart your cluster and run a sample job.

Check the DBFS location that you configured for CSV metrics and verify that they were correctly written.