Incorrect results when using documents as inputs
Problem
You have a ML model that takes documents as inputs, specifically, an array of strings.
You use a feature extractor like TfidfVectorizer
to convert the documents to an array of strings and ingest the array into the model.
The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JSON inputs.
Cause
TfidfVectorizer
expects an array of documents as an input.
Databricks converts inputs to Pandas DataFrames, which TfidfVectorizer
does not process correctly.
Solution
You must create a custom transformer and add it to the head of the pipeline.
For example, the following sample code checks the input for DataFrames. If it finds a DataFrame, the first column is converted to an array of documents. The array of documents is then passed to TfidfVectorizer
before being ingested into the model.
class DataFrameToDocs():
def transform(self, input_df):
import pandas as pd
if isinstance(input_df, pd.DataFrame):
return input_df[0].values
else:
return input_df def fit(self, X, y=None, **fit_params):
return self
steps = [('dftodocs', DataFrameToDocs()),('tfidf', TfidfVectorizer()), ('nb_clf', MultinomialNB())]
pipeline = Pipeline(steps)
Note
When input as JSON, both ["Hello", "World"]
and [["Hello"],["World"]]
return the same output.