Spark has a configurable metrics system that supports a number of sinks, including CSV files.
In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location.
Create an init script
All of the configuration is done in an init script.
The init script does the following three things:
- Configures the cluster to generate CSV metrics on both the driver and the worker.
- Writes the CSV metrics to a temporary, local folder.
- Uploads the CSV metrics from the temporary, local folder to the chosen DBFS location.
Customize the sample code and then run it in a notebook to create an init script on your cluster.
Sample code to create an init script:
%python dbutils.fs.put("/<init-path>/metrics.sh",""" #!/bin/bash mkdir /tmp/csv sudo bash -c "cat <<EOF >> /databricks/spark/dbconf/log4j/master-worker/metrics.properties *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink spark.metrics.staticSources.enabled true spark.metrics.executorMetricsSource.enabled true spark.executor.processTreeMetrics.enabled true spark.sql.streaming.metricsEnabled true master.source.jvm.class org.apache.spark.metrics.source.JvmSource worker.source.jvm.class org.apache.spark.metrics.source.JvmSource *.sink.csv.period 5 *.sink.csv.unit seconds *.sink.csv.directory /tmp/csv/ worker.sink.csv.period 5 worker.sink.csv.unit seconds EOF" sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink spark.metrics.staticSources.enabled true spark.metrics.executorMetricsSource.enabled true spark.executor.processTreeMetrics.enabled true spark.sql.streaming.metricsEnabled true driver.source.jvm.class org.apache.spark.metrics.source.JvmSource executor.source.jvm.class org.apache.spark.metrics.source.JvmSource *.sink.csv.period 5 *.sink.csv.unit seconds *.sink.csv.directory /tmp/csv/ worker.sink.csv.period 5 worker.sink.csv.unit seconds EOF" cat <<'EOF' >> /tmp/asynccode.sh #!/bin/bash DB_CLUSTER_ID=$(echo $HOSTNAME | awk -F '-' '{print$1"-"$2"-"$3}') MYIP=$(hostname -I) if [[ ! -d /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP} ]] ; then sudo mkdir -p /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP} fi while true; do if [ -d "/tmp/csv" ]; then sudo cp -r /tmp/csv/* /dbfs/<metrics-path>/$DB_CLUSTER_ID/metrics-$MYIP fi sleep 5 done EOF chmod a+x /tmp/asynccode.sh /tmp/asynccode.sh & disown """, True)
Replace <init-path> with the DBFS location you want to use to save the init script.
Replace <metrics-path> with the DBFS location you want to use to save the CSV metrics.
Cluster-scoped init script
Once you have created the init script on your cluster, you must configure it as a cluster-scoped init script.
Verify that CSV metrics are correctly written
Restart your cluster and run a sample job.
Check the DBFS location that you configured for CSV metrics and verify that they were correctly written.