Problem
When you are attempting to register a machine learning model from Hugging Face that was trained outside of Databricks, the model fails with the error message Failed to infer Schema
:
`MLFlowException: Failed to infer Schema. Expected one of the following types:
-pandas.DataFrame
-pandas.Series…
File /databricks/python/lib/python3.11/site-packages/mlflow/types/utils.py:374 in infer_schema(data)...`
Cause
Databricks expects model artifacts to follow a specific structure. When calling mlflow.<flavor>.log_model
, MLflow arranges the model's artifacts properly for correct loading. If you attempt to register a model trained outside of Databricks or try to fine-tune it with additional data in a Databricks notebook, this may lead to a Failed to infer Schema
error due to the artifact structure not aligning with Databricks' expectations for Hugging Face models.
Solution
This issue arises in Databricks environments when working with machine learning models, particularly those trained outside of Databricks. To address the fact that the model artifacts are in the structure expected by Databricks, the entire code from the external environment should be used to retrain the model within Databricks before fine-tuning.
To resolve this issue, follow these steps:
- Configure the MLflow tracking server. Set up the MLflow tracking server to register the model in the code used to train it outside of Databricks.
- Modify and reorder the artifact's folder in such a way that it matches the structure of a Hugging Face model.
- Use MLflow logging. Use
mlflow.<flavor>.log_model
to log the model, which automatically handles the artifact structure.
The typical structure of a Hugging Face model includes:
-
config.json
: Contains the model configuration -
pytorch_model.bin
: The model weights -
tokenizer.json
or other tokenizer files: For text processing -
README.md
: A model card describing the model's purpose and usage
For more details on model structure and creating custom models compatible with the Hugging Face ecosystem, review the Create a custom architecture documentation.
Best practices
Databricks recommends the following best practices while creating the models:
- Ensure proper artifact structure. Before registering the model within a Databricks notebook, verify that the artifact structure aligns with Databricks' expectations.
- Understand model signatures. Familiarise yourself with
infer_signature
orModelSignature
methods to properly define input and output schemas for your models - Ensure you are familiar with the MLflow Python API documentation.
For more information, review the Track model development using MLflow (AWS | Azure | GCP) documentation.