Problem
When training and registering a model using Delta tables in Unity Catalog (UC) you can see the lineage graph, but not the source Delta tables used to create it.
Cause
Support for table-to-model lineage is available from MLflow 2.11.0 and above, which is available as part of Databricks Runtime 15.3 and above.
Solution
If you’re using Databricks Runtime 15.3 or above, to view the Delta tables in UC used to make the lineage graph, first load them using the following code.
train_spark = mlflow.data.load_delta(table_name=<catalog.schema.training-table-name>)
test_spark = mlflow.data.load_delta(table_name=<catalog.schema.test-data-table>)
Then, convert the tables to Pandas so the core model can take the Spark DataFrames as inputs. Create X_train
, X_test
, y_train
and y_test
using the following code.
X_train = train_spark.df.toPandas().drop([“<column-to-be-predicted>”], axis=1)
X_test = test_spark.df.toPandas().drop([“<column-to-be-predicted>”], axis=1)
y_train = train_spark.df.select(“<column-to-be-predicted>”).toPandas()
y_test = test_spark.df.select(“<column-to-be-predicted>”).toPandas()
Finally, when starting the MLflow run, log the input.
with mlflow.start_run(run_name='untuned_random_forest'):
…
model.fit(X_train_spark, y_train_spark)
mlflow.log_input(train_spark, "training")
mlflow.log_input(test_spark,"test")
...
If you do not want to use Databricks Runtime 15.3 or above, first install MLfLow version 2.11.0 manually, then follow the steps in the previous part of the solution.