Enable s3cmd for notebooks

Use an init script to enable s3cmd for use in notebooks.

Written by pavan.kumarchalamcharla

Last published at: May 16th, 2022

s3cmd is a client library that allows you to perform all AWS S3 operations from any machine.

s3cmd is not installed on Databricks clusters by default. You must install it via a cluster-scoped init script before it can be used.

Delete

Info

The sample init script stores the path to a secret in an environment variable. You should store secrets in this fashion because these environment variables are not accessible from other programs running in Apache Spark.

Create the init script

Run this sample script in a notebook to create the init script on your cluster.

%python

dbutils.fs.put("dbfs:/databricks/<path-to-init-script>/s3cmd-init.sh","""
#!/bin/bash
# Purpose: s3cmd installation and configuration

sudo apt-get -y install s3cmd
cat > /root/.s3cfg <<EOF
access_key = $ACCESS_KEY
secret_key = $SECRET_KEY
EOF
s3cmd ls

""",True)

Remember the path to the init script. You will need it when configuring your cluster.

Configure the init script

Follow the documentation to configure a cluster-scoped init script.

Specify the path to the init script. Use the same path that you used in the sample script (dbfs:/databricks/<directory>/s3cmd-init.sh).

Add secret environment variables

Avoid storing secrets directly in your init script. Instead, store the path to a secret in an environment variable.

ACCESS_KEY={{secrets/<scope-name>/<secret-name>}}
SECRET_KEY={{secrets/<scope-name>/<secret-name>}}

After you have configured the environment variables, your init script can use them.

Restart the cluster

After configuring the init script, restart the cluster.

You can now use s3cmd in notebooks with the %sh magic command.