Apache Spark is configured to suppress INFO statements but they overwhelm logs anyway

Modify your log4j2 configuration file directly within the Databricks environment.

Written by raahat.varma

Last published at: September 12th, 2024

Problem

You receive INFO statements despite configuring the Apache Spark settings to suppress INFO and give specific WARN statements. This issue is observed even after setting the 'py4j' logger to WARN and configuring the logging in the Spark config in the Databricks UI to WARN

The problem persists, leading to an overflow of INFO logs, which can be problematic when integrating with monitoring tools like DataDog.

Cause

Configuring Spark settings to suppress INFO logs does not override the default log4j2 settings in the Databricks cluster, which control logging behavior at a more granular level. These default log4j2 settings may still allow INFO log generation. 

Additionally, the integration with DataDog may not respect the Spark configuration settings, leading to the continued generation of INFO logs.

Solution

Modify the log4j2 configuration file directly within the Databricks environment. 

1. Use an init script that updates the log4j2.xml file to suppress INFO logs.

#!/bin/bash
set -e  # Exit script on any error
# Define the log4j2 configuration file path (modify if needed)
LOG4J2_PATH="/databricks/spark/dbconf/log4j/driver/log4j2.xml"
# Modify the log4j2 configuration file
echo "Updating log4j2.xml to suppress INFO logs"
sed -i 's/level="INFO"/level="WARN"/g' $LOG4J2_PATH
echo "Completed log4j2 config changes at `date`"

 

2. Upload the init script to the Workspace Files. (You can create a .sh file in the workspace files folder, add the contents of the script to the .sh file and use the init script on the cluster)

 

3. Configure the cluster to use the init script by setting it in the Init Scripts tab. 

"destination": "Workspace"
"/Users/<your-workspace-folder>/log4j_warn.sh"

 

4. Restart the cluster to apply the changes.