Problem: Library Unavailability Causing Job Failures

This topic explains an Import Error you may encounter when launching jobs that import external libraries.

Problem

When a job causes a node to restart, the job fails with the following error message:

ImportError: No module named XXX

Cause

The Cluster Manager is part of the Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R libraries when it restarts each node. Sometimes, library installation or downloading of artifacts from the internet can take more time than expected. This occurs due to network latency, or it occurs if the library that is being attached to the cluster has many dependent libraries.

The library installation mechanism guarantees that when a notebook attaches to a cluster, it can import installed libraries. When library installation through PyPI takes excessive time, the notebook attaches to the cluster before the library installation completes. In this case, the notebook is unable to import the library.

Solution

Method 1

Use notebook-scoped library installation commands in the notebook. You can enter the following commands in one cell, which ensures that all of the specified libraries are installed.

dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()

Method 2

To avoid delay in downloading the libraries from the internet repositories, you can cache the libraries in DBFS or S3.

For example, you can download the wheel or egg file for a Python library to a DBFS or S3 location. You can use the REST API or cluster-scoped init scripts to install libraries from DBFS or S3.

First, download the wheel or egg file from the internet to the DBFS or S3 location. This can be performed in a notebook as follows:

%sh
cd /dbfs/mnt/library
 wget <whl/egg file location from the pypi repository>

After the wheel or egg file download completes, you can install the library to the cluster using the REST API, UI, or init script commands.