Problem
You have a ML model that takes documents as inputs, specifically, an array of strings.
You use a feature extractor like TfidfVectorizer to convert the documents to an array of strings and ingest the array into the model.
The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JSON inputs.
Cause
TfidfVectorizer expects an array of documents as an input.
Databricks converts inputs to Pandas DataFrames, which TfidfVectorizer does not process correctly.
Solution
You must create a custom transformer and add it to the head of the pipeline.
For example, the following sample code checks the input for DataFrames. If it finds a DataFrame, the first column is converted to an array of documents. The array of documents is then passed to TfidfVectorizer before being ingested into the model.
%python class DataFrameToDocs(): def transform(self, input_df): import pandas as pd if isinstance(input_df, pd.DataFrame): return input_df[0].values else: return input_df def fit(self, X, y=None, **fit_params): return self steps = [('dftodocs', DataFrameToDocs()),('tfidf', TfidfVectorizer()), ('nb_clf', MultinomialNB())] pipeline = Pipeline(steps)