Updated May 16th, 2022 by shanmugavel.chandrakasu

H2O.ai Sparkling Water cluster not reachable

Problem You are trying to initialize H2O.ai’s Sparkling Water on Databricks Runtime 7.0 and above when you get a H2OClusterNotReachableException error message. %python import ai.h2o.sparkling._ val h2oContext = H2OContext.getOrCreate() ai.h2o.sparkling.backend.exceptions.H2OClusterNotReachableException: H2O cluster X.X.X.X:54321 - sparkling-water-ro...

0 min reading time
Updated October 26th, 2022 by shanmugavel.chandrakasu

Allow spaces and special characters in nested column names with Delta tables

Problem It is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names. When you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an AnalysisException error message. For example, if you try to read a JSON file, evaluate the DataFrame, and ...

1 min reading time
Updated May 20th, 2022 by shanmugavel.chandrakasu

Apache Spark job fails with Parquet column cannot be converted error

Problem You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message. The cluster is running Databricks Runtime 7.3 LTS or above. org.apache.spark.SparkException: Task failed while writing rows. Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://buc...

0 min reading time
Updated July 22nd, 2022 by shanmugavel.chandrakasu

MLflow 'invalid access token' error

Problem You have long-running MLflow tasks in your notebook or job and the tasks are not completed. Instead, they return a (403) Invalid access token error message. Error stack trace: MlflowException: API request to endpoint /api/2.0/mlflow/runs/create failed with error code  403 != 200. Response body: '<html> <head> <meta data-fr-htt...

1 min reading time
Updated December 5th, 2022 by shanmugavel.chandrakasu

Job fails with Spark Shuffle FetchFailedException error

Problem If your application contains any aggregation or join stages, the execution will require a Spark Shuffle stage. Depending on the specific configuration used, if you are running multiple streaming queries on an interactive cluster you may get a shuffle FetchFailedException error. ShuffleMapStage has failed the maximum allowable number of times...

1 min reading time
Updated May 16th, 2022 by shanmugavel.chandrakasu

Download artifacts from MLflow

By default, the MLflow client saves artifacts to an artifact store URI during an experiment. The artifact store URI is similar to /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. This artifact store is a MLflow managed location, so you cannot download artifacts directly. You must use client.download_artifacts in the ...

0 min reading time
Updated May 23rd, 2022 by shanmugavel.chandrakasu

from_json returns null in Apache Spark 3.0

Problem The from_json function is used to parse a JSON string and return a struct of values. For example, if you have the JSON string [{"id":"001","name":"peter"}], you can pass it to from_json with a schema and get parsed struct values in return. %python from pyspark.sql.functions import col, from_json display(   df.select(col('value'), from_json(c...

0 min reading time
Load More