Problem
Databricks throws an error when fitting a SparkML model or Pipeline:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168, 10.205.250.130, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)
Cause
Often, an error when fitting a SparkML model or Pipeline is a result of issues with the training data.
Solution
Check for the following issues:
- Identify and address NULL values in a dataset. Spark needs to know how to address missing values in the dataset.
- Discard rows with missing values with dropna().
- Impute some value like zero or the average value of the column. This solution depends on what is meaningful for the data set.
- Ensure that all training data is appropriately transformed to a numeric format. Spark needs to know how to handle categorical and string variables. A variety of feature transformers are available to address data specific cases.
- Check for collinearity. Highly correlated or even duplicate features may cause issues with model fitting. This occurs on rare occasions, but you should make sure to rule it out.