Shorten cluster provisioning time by using Docker containers

Learn how to speed up cluster provisioning by using Docker container services

Written by Adam Pavlacka

Last published at: November 30th, 2023

Problem:

You are running model predictions on scheduled jobs. However, every time you make a small change to your model package, it requires re-installation on the cluster, which slows down the provisioning process.

Solution:

To shorten cluster provisioning time, you can leverage Docker container services. 

  1. Create a golden container environment with your required libraries pre-installed.
  2. Use the Docker container as the base for your cluster.
  3. Modify the container to install any additional libraries specific to your project.
  4. Provision the cluster using the modified container.

By using Docker containers, you eliminate the need for each nodes to install a separate copy of the libraries, resulting in faster cluster provisioning.

For more information, refer to the Databricks documentation on custom containers (AWS | Azure).

Additionally, you can explore the Databricks GitHub repository for containers, which provides base container examples you can customize.