Problem
PyPMML is a Python PMML scoring library.
After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error.
%python from pypmml import Model modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml') Error : Py4JError: Could not find py4j jar at
Cause
This error occurs due to a dependency on the default Py4J library.
- Databricks Runtime 5.0-6.6 uses Py4J 0.10.7.
- Databricks Runtime 7.0 and above uses Py4J 0.10.9.
The default Py4J library is installed to a different location than a standard Py4J package. As a result, when PyPMML attempts to invoke Py4J from the default path, it fails.
Solution
Setup a cluster-scoped init script that copies the required Py4J jar file into the expected location.
- Use pip to install the version of Py4J that corresponds to your Databricks Runtime version.
For example, in Databricks Runtime 6.5 run pip install py4j==<0.10.7> in a notebook in install Py4J 0.10.7 on the cluster. - Run find /databricks/ -name "py4j*jar" in a notebook to confirm the full path to the Py4J jar file. It is usually located in a path similar to /databricks/python3/share/py4j/.
- Manually copy the Py4J jar file from the install path to the DBFS path /dbfs/py4j/.
- Run the following code snippet in a Python notebook to create the install-py4j-jar.sh init script. Make sure the version number of Py4J listed in the snippet corresponds to your Databricks Runtime version.
%python dbutils.fs.put("/databricks/init-scripts/install-py4j-jar.sh", """ #!/bin/bash echo "Copying at `date`" mkdir -p /share/py4j/ /current-release/ cp /dbfs/py4j/py4j<version number>.jar /share/py4j/ cp /dbfs/py4j/py4j<version number>.jar /current-release/ echo "Copying completed at `date`" """, True)
- Attach the install-py4j-jar.sh init script to your cluster, following the instructions in configure a cluster-scoped init script (AWS | Azure | GCP).
- Restart the cluster.
- Verify that PyPMML works as expected.