Apache Spark job fails with Parquet column cannot be converted error
Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message. The cluster is running Databricks Runtime 7.3 LTS or above. org.apache.spark.SparkException: Task failed while writing rows. Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://buc...
0 min reading timeDownload artifacts from MLflow
By default, the MLflow client saves artifacts to an artifact store URI during an experiment. The artifact store URI is similar to /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. This artifact store is a MLflow managed location, so you cannot download artifacts directly. You must use client.download_artifacts in the ...
0 min reading timeJob fails with Spark Shuffle FetchFailedException error
Problem If your application contains any aggregation or join stages, the execution will require a Spark Shuffle stage. Depending on the specific configuration used, if you are running multiple streaming queries on an interactive cluster you may get a shuffle FetchFailedException error. ShuffleMapStage has failed the maximum allowable number of times...
1 min reading timeH2O.ai Sparkling Water cluster not reachable
Problem You are trying to initialize H2O.ai’s Sparkling Water on Databricks Runtime 7.0 and above when you get a H2OClusterNotReachableException error message. %python import ai.h2o.sparkling._ val h2oContext = H2OContext.getOrCreate() ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster X.X.X.X:54321 - sparkling-water-ro...
0 min reading timefrom_json returns null in Apache Spark 3.0
Problem The from_json function is used to parse a JSON string and return a struct of values. For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return. %python from pyspark.sql.functions import col, from_json display( df.select(col('value'), from_json(c...
0 min reading timeMLflow 'invalid access token' error
Problem You have long-running MLflow tasks in your notebook or job and the tasks are not completed. Instead, they return a (403) Invalid access token error message. Error stack trace: MlflowException: API request to endpoint /api/2.0/mlflow/runs/create failed with error code 403 != 200. Response body: '<html> <head> <meta data-fr-htt...
1 min reading timeAllow spaces and special characters in nested column names with Delta tables
Problem It is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names. When you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an AnalysisException error message. For example, if you try to read a JSON file, evaluate the DataFrame, and ...
1 min reading time