How to speed up cross-validation
Hyperparameter tuning of Apache SparkML models takes a very long time, depending on the size of the parameter grid. You can improve the performance of the cross-validation step in SparkML to speed things up:
- Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multiple times benefit from a cache. Remember to call an action on the
DataFrame
for the cache to take effect. - Increase the parallelism parameter inside the
CrossValidator
, which sets the number of threads to use when running parallel algorithms. The default setting is 1. See the CrossValidator documentation for more information. - Don’t use the pipeline as the estimator inside the
CrossValidator
specification. In some cases where the featurizers are being tuned along with the model, running the whole pipeline inside theCrossValidator
makes sense. However, this executes the entire pipeline for every parameter combination and fold. Therefore, if only the model is being tuned, set the model specification as the estimator inside theCrossValidator
.
Note
CrossValidator
can be set as the final stage inside the pipeline after the featurizers. The best model identified by the CrossValidator
is output.