Tackling schema issues that arise for ML models trained outside of Databricks

Use the code from the external environment to retrain the model within Databricks before fine-tuning.

Written by Tarun Sanjeev

Last published at: April 29th, 2025

Problem

When you are attempting to register a machine learning model from Hugging Face that was trained outside of Databricks, the model fails with the error message Failed to infer Schema:

`MLFlowException: Failed to infer Schema. Expected one of the following types:
-pandas.DataFrame
-pandas.Series…
File /databricks/python/lib/python3.11/site-packages/mlflow/types/utils.py:374 in infer_schema(data)...`

 

Cause

Databricks expects model artifacts to follow a specific structure. When calling mlflow.<flavor>.log_model, MLflow arranges the model's artifacts properly for correct loading. If you attempt to register a model trained outside of Databricks or try to fine-tune it with additional data in a Databricks notebook, this may lead to a Failed to infer Schema error due to the artifact structure not aligning with Databricks' expectations for Hugging Face models. 

 

Solution

This issue arises in Databricks environments when working with machine learning models, particularly those trained outside of Databricks. To address the fact that the model artifacts are in the structure expected by Databricks, the entire code from the external environment should be used to retrain the model within Databricks before fine-tuning.

 

To resolve this issue, follow these steps:

  1. Configure the MLflow tracking server. Set up the MLflow tracking server to register the model in the code used to train it outside of Databricks.
  2. Modify and reorder the artifact's folder in such a way that it matches the structure of a Hugging Face model.
  3. Use MLflow logging. Use mlflow.<flavor>.log_model to log the model, which automatically handles the artifact structure.

 

The typical structure of a Hugging Face model includes:

  • config.json: Contains the model configuration
  • pytorch_model.bin: The model weights
  • tokenizer.json or other tokenizer files: For text processing
  • README.md: A model card describing the model's purpose and usage

 

For more details on model structure and creating custom models compatible with the Hugging Face ecosystem, review the Create a custom architecture documentation.

 

Best practices

Databricks recommends the following best practices while creating the models:

  • Ensure proper artifact structure. Before registering the model within a Databricks notebook, verify that the artifact structure aligns with Databricks' expectations.
  • Understand model signatures. Familiarise yourself with infer_signature or ModelSignature methods to properly define input and output schemas for your models
  • Ensure you are familiar with the MLflow Python API documentation.

 

For more information, review the Track model development using MLflow (AWSAzureGCP) documentation.