Depending on your use case, you may want to use both Docker Container Services (DCS) and Databricks Repos (AWS | Azure | GCP) at the same time. DCS does not work with Databricks Repos by default, however you can use a custom init script to use both.
If you have not installed an init script to configure DCS with Databricks Repos you may see an error message when you try to start your cluster. This happens when the underlying filesystem becomes inaccessible.
You may see the below error while using repo without having the init script:
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy return_value = getattr(self.pool[obj_id], method)(*params) File "/databricks/python_shell/scripts/PythonShellImpl.py", line 935, in initStartingDirectory os.chdir(directory) FileNotFoundError: [Errno 2] No such file or directory: '/Workspace/Repos/<username>/hello_world'
You can use the example init script in this article to get DCS working with Databricks Repos.
This init script ensures the goofy-dbr process is correctly running which ensures the filesystem remains accessible. The goofy-dbr process is a Databricks internal fork of goofys. Databricks' goofy-dbr adds support for Azure Data Lake Storage (ADLS) and Azure Blob Storage to goofys, as well as ensuring that goofys can run on Databricks clusters.
Create the init script
- Ensure that you have a directory to store your init scripts. If you do not have one, create one.
- Create the init-script.
- Use this sample code to create an init script called repo.sh on your cluster. Replace <init-scripts>with the location of your init scripts.
%scala dbutils.fs.put("dbfs:/databricks/<init-script-folder>/repo.sh", """ #!/bin/bash set -o xtrace source /databricks/spark/conf/spark-env.sh export WSFS_ENABLE_DEBUG_LOG mkdir -p /Workspace mkdir -p /databricks/data/logs/ nohup /databricks/spark/scripts/fuse/wsfs /Workspace > /databricks/data/logs/wsfs.log 2>&1 & WAIT_TIMEOUT=5 CHECK_INTERVAL=0.1 WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT)) until mountpoint -q /Workspace || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do sleep $CHECK_INTERVAL done mkdir -p /dbfs nohup /databricks/spark/scripts/fuse/goofys-dbr -f -o allow_other \ --file-mode=0777 --dir-mode=0777 -o bg --http-timeout 120s \ /: /dbfs > /databricks/data/logs/dbfs_fuse_stderr 2>&1 & WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT)) until mountpoint -q /dbfs || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do sleep $CHECK_INTERVAL done """,true)
- Verify that the init script was successfully created on your cluster.
- Make sure to record the full path to the init script. You will need it when you configure the init script.
Configure the init script
Specify the path to the init script. Use the same path that you used in the sample script.
After configuring the init script, restart the cluster.