Cannot import module in egg library
Problem You try to install an egg library to your cluster and it fails with a message that the a module in the library cannot be imported. Even a simple import fails. import sys egg_path='/dbfs/<path-to-egg-file>/<egg-file>.egg' sys.path.append(egg_path) import shap_master Cause This error message occurs due to the way the library is pac...
Cannot import TabularPrediction from AutoGluon
Problem You are trying to import TabularPrediction from AutoGluon, but are getting an error message. ImportError: cannot import name 'TabularPrediction' from 'autogluon' (unknown location) This happens when AutoGluon is installed via a notebook or as a cluster-installed library (AWS | Azure | GCP). You can reproduce the error by running the import c...
Latest PyStan fails to install on Databricks Runtime 6.4
Problem You are trying to install the PyStan PyPi package on a Databricks Runtime 6.4 Extended Support cluster and get a ManagedLibraryInstallFailed error message. java.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, pystan, --disable-pip-version-check) exited wit...
Library unavailability causing job failures
Problem You are launching jobs that import external libraries and get an Import Error. When a job causes a node to restart, the job fails with the following error message: ImportError: No module named XXX Cause The Cluster Manager is part of the Databricks service that manages customer Apache Spark clusters. It sends commands to install Python and R...
How to correctly update a Maven library in Databricks
Problem You make a minor update to a library in the repository, but you don’t want to change the version number because it is a small change for testing purposes. When you attach the library to your cluster again, your code changes are not included in the library. Cause One strength of Databricks is the ability to install third-party or custom libra...
Init script fails to download Maven JAR
Problem You have an init script that is attempting to install a library via Maven, but it fails when trying to download a JAR. https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.4.1/rapids-4-spark_2.12-0.4.1.jar%0D Resolving repo1.maven.org (repo1.maven.org)... 151.101.248.209 Connecting to repo1.maven.org (repo1.maven.org)|151.101.248....
Install package using previous CRAN snapshot
Problem You are trying to install a library package via CRAN, and are getting a Library installation failed for library due to infra fault error message. Library installation failed for library due to infra fault for Some(cran { package: "<name-of-package>" } ). Error messages: java.lang.RuntimeException: Installation failed with message: Erro...
Install PyGraphViz
PyGraphViz Python libraries are used to plot causal inference networks. If you try to install PyGraphViz as a standard library, it fails due to dependency errors. PyGraphViz has the following dependencies: python3-dev graphviz libgraphviz-dev pkg-config Install via notebook Install the dependencies with apt-get.%sh sudo apt-get install -y python3-de...
Install Turbodbc via init script
Turbodbc is a Python module that uses the ODBC interface to access relational databases. It has dependencies on libboost-all-dev, unixodbc-dev, and python-dev packages, which need to be installed in order. You can install these manually, or you can use an init script to automate the install. Create the init script Run this sample script in a noteboo...
Cannot uninstall library from UI
Problem Usually, libraries can be uninstalled in the Clusters UI. If the checkbox to select the library is disabled, then it’s not possible to uninstall the library from the UI. Cause If you create a library using REST API version 1.2 and if auto-attach is enabled, the library is installed on all clusters. In this scenario, the Clusters UI checkbox ...
Error when installing Cartopy on a cluster
Problem You are trying to install Cartopy on a cluster and you receive a ManagedLibraryInstallFailed error message. java.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, cartopy==0.17.0, --disable-pip-version-check) exited with code 1. ERROR: Command errored out ...
Error when installing pyodbc on a cluster
Problem One of the following errors occurs when you use pip to install the pyodbc library. java.lang.RuntimeException: Installation failed with message: Collecting pyodbc "Library installation is failing due to missing dependencies. sasl and thrift_sasl are optional dependencies for SASL or Kerberos support" Cause Although sasl and thrift_sasl are o...
Libraries fail with dependency exception
Problem You have a Python function that is defined in a custom egg or wheel file and also has dependencies that are satisfied by another customer package installed on the cluster. When you call this function, it returns an error that says the requirement cannot be satisfied. org.apache.spark.SparkException: Process List(/local_disk0/pythonVirtualEnv...
Libraries failing due to transient Maven issue
Problem Job fails because libraries cannot be installed. Library resolution failed. Cause: java.lang.RuntimeException: Cannot download some libraries due to transient Maven issue. Please try again later Cause After a Databricks upgrade, your cluster attempts to download any required libraries from Maven. After downloading, the libraries are stored a...
Reading .xlsx files with xlrd fails
Problem You are have xlrd installed on your cluster and are attempting to read files in the Excel .xlsx format when you get an error. XLRDError: Excel xlsx file; not supported Cause xlrd 2.0.0 and above can only read .xls files. Support for .xlsx files was removed from xlrd due to a potential security vulnerability. Solution Use openpyxl to open .xl...
Remove Log4j 1.x JMSAppender and SocketServer classes from classpath
Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. Databricks also does not use the affected classes from Log4j 1.x with known vul...
Replace a default library jar
Databricks includes a number of default Java and Scala libraries. You can replace any of these libraries with another version by using a cluster-scoped init script to remove the default library jar and then install the version you require. Warning Removing default libraries and installing new versions may cause instability or completely break your D...
Python command fails with AssertionError: wrong color format
Problem You run a Python notebook and it fails with an AssertionError: wrong color format message. An example stack trace: File "/local_disk0/tmp/1599775649524-0/PythonShell.py", line 39, in <module> from IPython.nbconvert.filters.ansi import ansi2html File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<...
PyPMML fails with Could not find py4j jar error
Problem PyPMML is a Python PMML scoring library. After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error. %python from pypmml import Model modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml') Error : Py4JError: Could not find py4j jar at Cause This error occurs due to a dependency on the defa...
TensorFlow fails to import
Problem You have TensorFlow installed on your cluster. When you try to import TensorFlow, it fails with an Invalid Syntax or import error. Cause The version of protobuf installed on your cluster is not compatible with your version of TensorFlow. Solution Use a cluster-scoped init script to install TensorFlow with matching versions of NumPy and proto...
Verify the version of Log4j on your cluster
Databricks recently published a blog on Log4j 2 Vulnerability (CVE-2021-44228) Research and Assessment. Databricks does not directly use a version of Log4j known to be affected by this vulnerability within the Databricks platform in a way we understand may be vulnerable. If you are using Log4j within your cluster (for example, if you are processing ...
Apache Spark jobs fail with Environment directory not found error
Problem After you install a Python library (via the cluster UI or by using pip), your Apache Spark jobs fail with an Environment directory not found error message. org.apache.spark.SparkException: Environment directory not found at /local_disk0/.ephemeral_nfs/cluster_libraries/python Cause Libraries are installed on a Network File System (NFS) on th...