Use Databricks Repos with Docker container services

Configure your cluster with a custom init script to use Databricks Repos with Docker container services.

Last published at: May 10th, 2023

Introduction

Depending on your use case, you may want to use both Docker Container Services (DCS) and Databricks Repos (AWS | Azure | GCP) at the same time. DCS does not work with Databricks Repos by default, however you can use a custom init script to use both.

If you have not installed an init script to configure DCS with Databricks Repos you may see an error message when you try to start your cluster. This happens when the underlying filesystem becomes inaccessible.

You may see the below error while using repo without having the init script:

py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "/databricks/python_shell/scripts/PythonShellImpl.py", line 935, in initStartingDirectory
os.chdir(directory)
FileNotFoundError: [Errno 2] No such file or directory: '/Workspace/Repos/<username>/hello_world'

Instructions

You can use the example init script in this article to get DCS working with Databricks Repos.

This init script ensures the goofy-dbr process is correctly running which ensures the filesystem remains accessible. The goofy-dbr process is a Databricks internal fork of goofys. Databricks' goofy-dbr adds support for Azure Data Lake Storage (ADLS) and Azure Blob Storage to goofys, as well as ensuring that goofys can run on Databricks clusters.

Create the init script

Use the workspace file browser to create a new file (AWS | Azure | GCP) in your home directory. Call it repos.sh.
Open the repos.sh file.

Copy and paste this init script into repos.sh.

#!/bin/bash

set -o xtrace

source /databricks/spark/conf/spark-env.sh

export WSFS_ENABLE_DEBUG_LOG

mkdir -p /Workspace
mkdir -p /databricks/data/logs/

nohup /databricks/spark/scripts/fuse/wsfs /Workspace > /databricks/data/logs/wsfs.log 2>&1 &

WAIT_TIMEOUT=5
CHECK_INTERVAL=0.1
WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT))

until mountpoint -q /Workspace || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do
  sleep $CHECK_INTERVAL
done

mkdir -p /dbfs
nohup /databricks/spark/scripts/fuse/goofys-dbr -f -o allow_other \
              --file-mode=0777 --dir-mode=0777 -o bg --http-timeout 120s \
              /: /dbfs > /databricks/data/logs/dbfs_fuse_stderr 2>&1 &

WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT))

until mountpoint -q /dbfs || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do
  sleep $CHECK_INTERVAL
done

Close the file.

Configure the init script

Follow the documentation to configure a cluster-scoped init script (AWS | Azure | GCP) as a workspace file.

Specify the path to the init script. Since you created repos.sh in your home directory, the path should look like /Users/<your-username>/repos.sh.

After configuring the init script, restart the cluster.

Databricks Help Center

Introduction

Instructions

Create the init script

Configure the init script

Contact Us