Introduction
Depending on your use case, you may want to use both Docker Container Services (DCS) and Databricks Repos (AWS | Azure | GCP) at the same time. DCS does not work with Databricks Repos by default, however you can use a custom init script to use both.
If you have not installed an init script to configure DCS with Databricks Repos you may see an error message when you try to start your cluster. This happens when the underlying filesystem becomes inaccessible.
You may see the below error while using repo without having the init script:
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy return_value = getattr(self.pool[obj_id], method)(*params) File "/databricks/python_shell/scripts/PythonShellImpl.py", line 935, in initStartingDirectory os.chdir(directory) FileNotFoundError: [Errno 2] No such file or directory: '/Workspace/Repos/<username>/hello_world'
Instructions
You can use the example init script in this article to get DCS working with Databricks Repos.
This init script ensures the goofy-dbr process is correctly running which ensures the filesystem remains accessible. The goofy-dbr process is a Databricks internal fork of goofys. Databricks' goofy-dbr adds support for Azure Data Lake Storage (ADLS) and Azure Blob Storage to goofys, as well as ensuring that goofys can run on Databricks clusters.
Create the init script
- Use the workspace file browser to create a new file (AWS | Azure | GCP) in your home directory. Call it repos.sh.
- Open the repos.sh file.
- Copy and paste this init script into repos.sh.
#!/bin/bash set -o xtrace source /databricks/spark/conf/spark-env.sh export WSFS_ENABLE_DEBUG_LOG mkdir -p /Workspace mkdir -p /databricks/data/logs/ nohup /databricks/spark/scripts/fuse/wsfs /Workspace > /databricks/data/logs/wsfs.log 2>&1 & WAIT_TIMEOUT=5 CHECK_INTERVAL=0.1 WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT)) until mountpoint -q /Workspace || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do sleep $CHECK_INTERVAL done mkdir -p /dbfs nohup /databricks/spark/scripts/fuse/goofys-dbr -f -o allow_other \ --file-mode=0777 --dir-mode=0777 -o bg --http-timeout 120s \ /: /dbfs > /databricks/data/logs/dbfs_fuse_stderr 2>&1 & WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT)) until mountpoint -q /dbfs || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do sleep $CHECK_INTERVAL done
- Close the file.
Configure the init script
Follow the documentation to configure a cluster-scoped init script (AWS | Azure | GCP) as a workspace file.
Specify the path to the init script. Since you created repos.sh in your home directory, the path should look like /Users/<your-username>/repos.sh.
After configuring the init script, restart the cluster.