Set Apache Hadoop core-site.xml properties

Set Apache Hadoop core-site.xml properties in a Databricks cluster.

Written by arjun.kaimaparambilrajan

Last published at: March 4th, 2022

You have a scenario that requires Apache Hadoop properties to be set.

You would normally do this in the core-site.xml file.

In this article, we explain how you can set core-site.xml in a cluster.

Create the core-site.xml file in DBFS

You need to create a core-site.xml file and save it to DBFS on your cluster.

An easy way to create this file is via a bash script in a notebook.

This example code creates a hadoop-configs folder on your cluster and then writes a single property core-site.xml file to that folder.

%sh

mkdir -p /dbfs/hadoop-configs/
cat << 'EOF' > /dbfs/hadoop-configs/core-site.xml
 <property>
    <name><property-name-here></name>
    <value><property-value-here></value>
 </property>
EOF

You can add multiple properties to the file by adding additional name/value pairs to the script.

You can also create this file locally, and then upload it to your cluster.

Create an init script that loads core-site.xml

This example code creates an init script called set-core-site-configs.sh that uses the core-site.xml file you just created.

If you manually uploaded a core-site.xml file and stored it elsewhere, you should update the config_xml value in the example code.

%python

dbutils.fs.put("/databricks/scripts/set-core-site-configs.sh", """
#!/bin/bash
   
echo "Setting core-site.xml configs at `date`"
 
START_DRIVER_SCRIPT=/databricks/spark/scripts/start_driver.sh
START_WORKER_SCRIPT=/databricks/spark/scripts/start_spark_slave.sh
 
TMP_DRIVER_SCRIPT=/tmp/start_driver_temp.sh
TMP_WORKER_SCRIPT=/tmp/start_spark_slave_temp.sh
 
TMP_SCRIPT=/tmp/set_core-site_configs.sh
 
config_xml="/dbfs/hadoop-configs/core-site.xml"

cat >"$TMP_SCRIPT" <<EOL
#!/bin/bash
## Setting core-site.xml configs

sed -i '/<\/configuration>/{
    r $config_xml
    a \</configuration>
    d
}' /databricks/spark/dbconf/hadoop/core-site.xml
 
EOL
cat "$TMP_SCRIPT" > "$TMP_DRIVER_SCRIPT"
cat "$TMP_SCRIPT" > "$TMP_WORKER_SCRIPT"
 
cat "$START_DRIVER_SCRIPT" >> "$TMP_DRIVER_SCRIPT"
mv "$TMP_DRIVER_SCRIPT" "$START_DRIVER_SCRIPT"
 
cat "$START_WORKER_SCRIPT" >> "$TMP_WORKER_SCRIPT"
mv "$TMP_WORKER_SCRIPT" "$START_WORKER_SCRIPT"
 
echo "Completed core-site.xml config changes `date`" 
 
""", True)

Attach the init script to your cluster

You need to configure the newly created init script as a cluster-scoped init script.

If you used the example code, your Destination is DBFS and the Init Script Path is dbfs:/databricks/scripts/set-core-site-configs.sh.

If you customized the example code, ensure that you enter the correct path and name of the init script when you attach it to the cluster.