Use Databricks Repos with Docker container services

Configure your cluster with a custom init script to use Databricks Repos with Docker container services.

Written by darshan.bargal

Last published at: September 28th, 2022

Introduction

Depending on your use case, you may want to use both Docker Container Services (DCS) and Databricks Repos (AWS | Azure | GCP) at the same time. DCS does not work with Databricks Repos by default, however you can use a custom init script to use both.

If you have not installed an init script to configure DCS with Databricks Repos you may see an error message when you try to start your cluster. This happens when the underlying filesystem becomes inaccessible.

You may see the below error while using repo without having the init script:

py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "/databricks/python_shell/scripts/PythonShellImpl.py", line 935, in initStartingDirectory
os.chdir(directory)
FileNotFoundError: [Errno 2] No such file or directory: '/Workspace/Repos/<username>/hello_world'

Instructions

You can use the example init script in this article to get DCS working with Databricks Repos.

This init script ensures the goofy-dbr process is correctly running which ensures the filesystem remains accessible. The goofy-dbr process is a Databricks internal fork of goofys. Databricks' goofy-dbr adds support for Azure Data Lake Storage (ADLS) and Azure Blob Storage to goofys, as well as ensuring that goofys can run on Databricks clusters.

Create the init script

  1. Ensure that you have a directory to store your init scripts. If you do not have one, create one.
    %scala
    
    dbutils.fs.mkdirs("dbfs:/databricks/<init-script-folder>/")
  2. Create the init-script.
  3. Use this sample code to create an init script called repo.sh on your cluster. Replace <init-scripts>with the location of your init scripts.
    %scala
    
    dbutils.fs.put("dbfs:/databricks/<init-script-folder>/repo.sh", """
    #!/bin/bash
    
    set -o xtrace
    
    source /databricks/spark/conf/spark-env.sh
    
    export WSFS_ENABLE_DEBUG_LOG
    
    mkdir -p /Workspace
    mkdir -p /databricks/data/logs/
    
    nohup /databricks/spark/scripts/fuse/wsfs /Workspace > /databricks/data/logs/wsfs.log 2>&1 &
    
    WAIT_TIMEOUT=5
    CHECK_INTERVAL=0.1
    WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT))
    
    until mountpoint -q /Workspace || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do
      sleep $CHECK_INTERVAL
    done
    
    mkdir -p /dbfs
    nohup /databricks/spark/scripts/fuse/goofys-dbr -f -o allow_other \
                  --file-mode=0777 --dir-mode=0777 -o bg --http-timeout 120s \
                  /: /dbfs > /databricks/data/logs/dbfs_fuse_stderr 2>&1 &
    
    WAIT_UNTIL=$(($(date +%s) + $WAIT_TIMEOUT))
    
    until mountpoint -q /dbfs || [[ $(date +%s) -ge $WAIT_UNTIL ]]; do
      sleep $CHECK_INTERVAL
    done
    
    """,true)
  4. Verify that the init script was successfully created on your cluster.
    %scala
    
    display(dbutils.fs.ls("dbfs:/databricks/<init-script-folder>/repo.sh"))
  5. Make sure to record the full path to the init script. You will need it when you configure the init script.

Configure the init script

Follow the documentation to configure a cluster-scoped init script (AWS | Azure | GCP).

Specify the path to the init script. Use the same path that you used in the sample script.

After configuring the init script, restart the cluster.