You have a ML model that takes documents as inputs, specifically, an array of strings.
You use a feature extractor like
TfidfVectorizer to convert the documents to an array of strings and ingest the array into the model.
The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JSON inputs.
TfidfVectorizer expects an array of documents as an input.
Databricks converts inputs to Pandas DataFrames, which
TfidfVectorizer does not process correctly.
You must create a custom transformer and add it to the head of the pipeline.
For example, the following sample code checks the input for DataFrames. If it finds a DataFrame, the first column is converted to an array of documents. The array of documents is then passed to
TfidfVectorizer before being ingested into the model.
class DataFrameToDocs(): def transform(self, input_df): import pandas as pd if isinstance(input_df, pd.DataFrame): return input_df.values else: return input_df def fit(self, X, y=None, **fit_params): return self steps = [('dftodocs', DataFrameToDocs()),('tfidf', TfidfVectorizer()), ('nb_clf', MultinomialNB())] pipeline = Pipeline(steps)
When input as JSON, both
["Hello", "World"] and
[["Hello"],["World"]] return the same output.