KNN model using pyfunc returns ModuleNotFoundError or FileNotFoundError

Problem

You have created a Sklearn model using KNeighborsClassifier and are using pyfunc to run a prediction.

For example:

import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type='string')
predicted_df = merge.withColumn("prediction", pyfunc_udf(*merge.columns[1:]))
predicted_df.collect()

The prediction returns a ModuleNotFoundError: No module named 'sklearn.neighbors._classification' error message.

The prediction may also return a FileNotFoundError: [Errno 2] No usable temporary directory found error message.

Cause

When a KNN model is logged, all of the data points used for training are saved as part of the pickle file.

If the model is trained with millions of records, all of that data is added to the model, which can dramatically increase its size. A model trained on millions of records can easily total multiple GBs.

pyfunc attempts to load the entire model into the executor’s cache when running a prediction.

If the model is too big to fit into memory, it results in one of the above error messages.

Solution

You should use a tree-based algorithm, such as Random Forest or XGBoost to downsample the data in a KNN model.

If you have unbalanced data, attempt a sampling method like SMOTE, when training a tree-based algorithm.