Problem
You have a Python function that is defined in a custom egg or wheel file and also has dependencies that are satisfied by another customer package installed on the cluster.
When you call this function, it returns an error that says the requirement cannot be satisfied.
org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnvDirs/virtualEnv-d82b31df-1da3-4ee9-864d-8d1fce09c09b/bin/python, /local_disk0/pythonVirtualEnvDirs/virtualEnv-d82b31df-1da3-4ee9-864d-8d1fce09c09b/bin/pip, install, fractal==0.1.0, --disable-pip-version-check) exited with code 1. Could not find a version that satisfies the requirement fractal==0.1.0 (from versions: 0.1.1, 0.1.2, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0)
As an example, imagine that you have both wheel A and wheel B installed, either to the cluster via the UI or via notebook-scoped libraries. Assume that wheel A has a dependency on wheel B.
- dbutils.library.install(/path_to_wheel/A.whl)
- dbutils.library.install(/path_to_wheel/B.whl)
When you try to make a call using one of these libraries, you get a requirement cannot be satisfied error.
Cause
Even though the requirements have been met by installing the required dependencies via the cluster UI or via a notebook-scoped library installation, Databricks cannot guarantee the order in which specific libraries are installed on the cluster. If a library is being referenced and it has not been distributed to the executor nodes, it will fallback to PyPI and use it locally to satisfy the requirement.
Solution
You should use one egg or wheel file that contains all required code and dependencies. This ensures that your code has the correct libraries loaded and available at run time.