Jobs failing on Databricks Runtime 5.5 LTS with an SQLAlchemy package error

Problem

Databricks jobs that require the third-party library SQLAlchemy are failing. This issue started occurring on or about March 10, 2020.

The error message location differs for job clusters and all-purpose clusters, but it is similar to the following example error message:

Library installation failed for library pypi {
 package: "sqlalchemy"
}
. Error messages:
java.lang.RuntimeException: ManagedLibraryInstallFailed: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, sqlalchemy, --disable-pip-version-check) exited with code 2. ERROR: Exception:

Failure on job clusters

On a job cluster the error manifests as a failure to start. You can confirm the issue by viewing the job run results and looking for text similar to the example error message.

Failure on all-purpose clusters

If you have all-purpose clusters using Databricks Runtime 5.5 LTS, you can view the error message within the workspace UI.

  1. Click Clusters.
  2. Click the name of your cluster.
  3. Click Libraries.
  4. Click sqlalchemy.
  5. Read the error messages under the Messages heading.

Look for text similar to the example error message.

Version

The problem affects clusters on Databricks Runtime 5.5 LTS using SQLAlchemy 1.3.15.

Cause

On March 10, 2020, PyPI updated the release version of SQLAlchemy to 1.3.15. If you are using PyPI to automatically download the most current version of SQLAlchemy and you are using Databricks Runtime 5.5 LTS clusters, the update to SQLAlchemy may result in job failures.

Solution

There are two workarounds available.

  • Restrict the version of SQLAlchemy to 1.3.13.
  • Prevent Python from using the pep517 build system.

Restrict SQLAlchemy to version 1.3.13

Install SQLAlchemy using the PyPI package installation instructions and set the version value to 1.3.13.

Prevent Python from using the pep517 build system

You can use an init script to prevent Python from using the pep517 build system.

Use the following code block to generate the init script install-sqlalchemy.sh on your cluster:

dbutils.fs.put("/databricks/init-scripts/install-sqlalchemy.sh", """
#!/bin/bash
/databricks/python/bin/pip install sqlalchemy --disable-pip-version-check --no-use-pep517
""", True)

Follow the existing documentation to install the script as a cluster-scoped init script. Restart the cluster after you have installed the script.

Best practice recommendation

Whenever you use third-party libraries, you should always configure your clusters to use specific versions of each library that are known to be working. New versions of libraries can offer new features, but they can also introduce problems if they are deployed without testing and validation.