Incorrect results when using documents as inputs

Problem

You have a ML model that takes documents as inputs, specifically, an array of strings.

You use a feature extractor like TfidfVectorizer to convert the documents to an array of strings and ingest the array into the model.

The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JSON inputs.

Cause

TfidfVectorizer expects an array of documents as an input.

Databricks converts inputs to Pandas DataFrames, which TfidfVectorizer does not process correctly.

Solution

You must create a custom transformer and add it to the head of the pipeline.

For example, the following sample code checks the input for DataFrames. If it finds a DataFrame, the first column is converted to an array of documents. The array of documents is then passed to TfidfVectorizer before being ingested into the model.

class DataFrameToDocs():
    def transform(self, input_df):
        import pandas as pd
        if isinstance(input_df, pd.DataFrame):
          return input_df[0].values
        else:
          return input_df    def fit(self, X, y=None, **fit_params):
        return self

steps = [('dftodocs', DataFrameToDocs()),('tfidf', TfidfVectorizer()), ('nb_clf', MultinomialNB())]
pipeline = Pipeline(steps)

Note

When input as JSON, both ["Hello", "World"] and [["Hello"],["World"]] return the same output.