How to Speed Up Cross-Validation

Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up:

  • Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multiple times benefit from a cache. Remember to call an action on the DataFrame for the cache to take effect.
  • Increase the parallelism parameter inside the CrossValidator, which sets the number of threads to use when running parallel algorithms. The default setting is 1. See the CrossValidator documentation for more information.
  • Don’t use the pipeline as the estimator inside the CrossValidator specification. In some cases where the featurizers are being tuned along with the model, running the whole pipeline inside the CrossValidator makes sense. However, this executes the entire pipeline for every parameter combination and fold. Therefore, if only the model is being tuned, set the model specification as the estimator inside the CrossValidator.

Note

CrossValidator can be set as the final stage inside the pipeline after the featurizers. The best model identified by the CrossValidator is output.