Problem:
You are running model predictions on scheduled jobs. However, every time you make a small change to your model package, it requires re-installation on the cluster, which slows down the provisioning process.
Solution:
To shorten cluster provisioning time, you can leverage Docker container services.
- Create a golden container environment with your required libraries pre-installed.
- Use the Docker container as the base for your cluster.
- Modify the container to install any additional libraries specific to your project.
- Provision the cluster using the modified container.
By using Docker containers, you eliminate the need for each nodes to install a separate copy of the libraries, resulting in faster cluster provisioning.
For more information, refer to the Databricks documentation on custom containers (AWS | Azure).
Additionally, you can explore the Databricks GitHub repository for containers, which provides base container examples you can customize.