Conda fails to download packages from Anaconda
Problem You are attempting to download packages from the Anaconda repository and get a PackagesNotFoundError error message. This error can occur when using %conda, or %sh conda in notebooks, and when using Conda in an init script. Cause Anaconda Inc. updated the terms of service for repo.anaconda.com and anaconda.org/anaconda. Based on the Anaconda ...
Download artifacts from MLflow
By default, the MLflow client saves artifacts to an artifact store URI during an experiment. The artifact store URI is similar to /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. This artifact store is a MLflow managed location, so you cannot download artifacts directly. You must use client.download_artifacts in the ...
How to extract feature information for tree-based Apache SparkML pipeline models
When you are fitting a tree-based model, such as a decision tree, random forest, or gradient boosted tree, it is helpful to be able to review the feature importance levels along with the feature names. Typically models in SparkML are fit as the last stage of the pipeline. To extract the relevant feature information from the pipeline with the tree mo...
Fitting an Apache SparkML model throws error
Problem Databricks throws an error when fitting a SparkML model or Pipeline: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168, 10.205.250.130, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfu...
H2O.ai Sparkling Water cluster not reachable
Problem You are trying to initialize H2O.ai’s Sparkling Water on Databricks Runtime 7.0 and above when you get a H2OClusterNotReachableException error message. %python import ai.h2o.sparkling._ val h2oContext = H2OContext.getOrCreate() ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster X.X.X.X:54321 - sparkling-water-ro...
How to perform group K-fold cross validation with Apache Spark
Cross validation randomly splits the training data into a specified number of folds. To prevent data leakage where the same data shows up in multiple folds you can use groups. scikit-learn supports group K-fold cross validation to ensure that the folds are distinct and non-overlapping. On Spark you can use the spark-sklearn library, which distribute...
MLflow project fails to access an Apache Hive table
Problem You have an MLflow project that fails to access a Hive table and returns a Table or view not found error. pyspark.sql.utils.AnalysisException: "Table or view not found: `default`.`tab1`; line 1 pos 21;\n'Aggregate [unresolvedalias(count(1), None)]\n+- 'UnresolvedRelation `default`.`tab1`\n" xxxxx ERROR mlflow.cli: === Run (ID 'xxxxx') failed...
How to speed up cross-validation
Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up: Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multi...
Hyperopt fails with maxNumConcurrentTasks error
Problem You are tuning machine learning parameters using Hyperopt when your job fails with a py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist error. You are using a Databricks Runtime for Machine Learning (Databricks Runtime ML) cluster. Cause Databricks Runtime ML has a compatible version of Hyperopt pre-installed (AWS | Azure | ...
Incorrect results when using documents as inputs
Problem You have a ML model that takes documents as inputs, specifically, an array of strings. You use a feature extractor like TfidfVectorizer to convert the documents to an array of strings and ingest the array into the model. The model is trained, and predictions happen in the notebook, but model serving doesn’t return the expected results for JS...
Experiment warning when custom artifact storage location is used
Problem When you create an MLflow experiment with a custom artifact location, you get the following warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. MLflow cannot guarantee the enforcement of access controls on ar...
Experiment warning when legacy artifact storage location is used
Problem A new icon appears on the MLflow Experiments page with the following open access warning: Cause MLflow experiment permissions (AWS | Azure | GCP) are enforced on artifacts in MLflow Tracking, enabling you to easily control access to datasets, models, and other files. In MLflow 1.11 and above, new experiments store artifacts in an MLflow-mana...
KNN model using pyfunc returns ModuleNotFoundError or FileNotFoundError
Problem You have created a Sklearn model using KNeighborsClassifier and are using pyfunc to run a prediction. For example: %python import mlflow.pyfunc pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type='string') predicted_df = merge.withColumn("prediction", pyfunc_udf(*merge.columns[1:])) predicted_df.collect() The predict...
OSError when accessing MLflow experiment artifacts
Problem You get an OSError: No such file or directory error message when trying to download or log artifacts using one of the following: MlflowClient.download_artifacts() mlflow.[flavor].log_model() mlflow.[flavor].load_model() mlflow.log_artifacts() OSError: No such file or directory: '/dbfs/databricks/mlflow-tracking/<experiment-id>/<run-...
PERMISSION_DENIED error when accessing MLflow experiment artifact
Problem You get a PERMISSION_DENIED error when trying to access an MLflow artifact using the MLflow client. RestException: PERMISSION_DENIED: User <user> does not have permission to 'View' experiment with id <experiment-id> or RestException: PERMISSION_DENIED: User <user> does not have permission to 'Edit' experiment with id <ex...
Runs are not nested when SparkTrials is enabled in Hyperopt
Problem SparkTrials is an extension of Hyperopt, which allows runs to be distributed to Spark workers. When you start an MLflow run with nested=True in the worker function, the results are supposed to be nested under the parent run. Sometimes the results are not correctly nested under the parent run, even though you ran SparkTrials with nested=True ...