s3cmd is a client library that allows you to perform all AWS S3 operations from any machine.
s3cmd is not installed on Databricks clusters by default. You must install it via a cluster-scoped init script before it can be used.
The sample init script stores the path to a secret in an environment variable. You should store secrets in this fashion because these environment variables are not accessible from other programs running in Apache Spark.
Run this sample script in a notebook to create the init script on your cluster.
dbutils.fs.put("dbfs:/databricks/<path-to-init-script>/s3cmd-init.sh",""" #!/bin/bash # Purpose: s3cmd installation and configuration sudo apt-get -y install s3cmd cat > /root/.s3cfg <<EOF access_key = $ACCESS_KEY secret_key = $SECRET_KEY EOF s3cmd ls """,True)
Remember the path to the init script. You will need it when configuring your cluster.
Follow the documentation to configure a cluster-scoped init script.
Specify the path to the init script. Use the same path that you used in the sample script (
Avoid storing secrets directly in your init script. Instead, store the path to a secret in an environment variable.
After you have configured the environment variables, your init script can use them.
After configuring the init script, restart the cluster.
You can now use
s3cmd in notebooks with the
%sh magic command.