s3cmd is a client library that allows you to perform all AWS S3 operations from any machine.
s3cmd is not installed on Databricks clusters by default. You must install it via a cluster-scoped init script before it can be used.
Create the init script
Run this sample script in a notebook to create the init script on your cluster.
%python dbutils.fs.put("dbfs:/databricks/<path-to-init-script>/s3cmd-init.sh",""" #!/bin/bash # Purpose: s3cmd installation and configuration sudo apt-get -y install s3cmd cat > /root/.s3cfg <<EOF access_key = $ACCESS_KEY secret_key = $SECRET_KEY EOF s3cmd ls """,True)
Remember the path to the init script. You will need it when configuring your cluster.
Configure the init script
Follow the documentation to configure a cluster-scoped init script.
Specify the path to the init script. Use the same path that you used in the sample script (dbfs:/databricks/<directory>/s3cmd-init.sh).
Add secret environment variables
Avoid storing secrets directly in your init script. Instead, store the path to a secret in an environment variable.
ACCESS_KEY={{secrets/<scope-name>/<secret-name>}} SECRET_KEY={{secrets/<scope-name>/<secret-name>}}
After you have configured the environment variables, your init script can use them.
Restart the cluster
After configuring the init script, restart the cluster.
You can now use s3cmd in notebooks with the %sh magic command.