PyPMML fails with Could not find py4j jar error

Written by arjun.kaimaparambilrajan

Last published at: May 16th, 2022

Problem

PyPMML is a Python PMML scoring library.

After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error.

%python

from pypmml import Model
modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml')

Error : Py4JError: Could not find py4j jar at

Cause

This error occurs due to a dependency on the default Py4J library.

  • Databricks Runtime 5.0-6.6 uses Py4J 0.10.7.
  • Databricks Runtime 7.0 and above uses Py4J 0.10.9.

The default Py4J library is installed to a different location than a standard Py4J package. As a result, when PyPMML attempts to invoke Py4J from the default path, it fails.

Solution

Setup a cluster-scoped init script that copies the required Py4J jar file into the expected location.

  1. Use pip to install the version of Py4J that corresponds to your Databricks Runtime version.
    For example, in Databricks Runtime 6.5 run pip install py4j==<0.10.7> in a notebook in install Py4J 0.10.7 on the cluster.
  2. Run find /databricks/ -name "py4j*jar" in a notebook to confirm the full path to the Py4J jar file. It is usually located in a path similar to /databricks/python3/share/py4j/.
  3. Manually copy the Py4J jar file from the install path to the DBFS path /dbfs/py4j/.
  4. Run the following code snippet in a Python notebook to create the install-py4j-jar.sh init script. Make sure the version number of Py4J listed in the snippet corresponds to your Databricks Runtime version.
    %python
    
    dbutils.fs.put("/databricks/init-scripts/install-py4j-jar.sh", """
    
    #!/bin/bash
    echo "Copying at `date`"
    mkdir -p /share/py4j/ /current-release/
    cp /dbfs/py4j/py4j<version number>.jar /share/py4j/
    cp /dbfs/py4j/py4j<version number>.jar /current-release/
    echo "Copying completed at `date`"
    
    """, True)
  5. Attach the install-py4j-jar.sh init script to your cluster, following the instructions in configure a cluster-scoped init script (AWS | Azure | GCP).
  6. Restart the cluster.
  7. Verify that PyPMML works as expected.