Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. Often, there is existing R code that is run locally and that is converted to run on Apache Spark. In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed computing. In such cases, the SparkR UDF API can be used to distribute the desired workload across a cluster.
Example use case: You want to train multiple machine learning models on the same data, for example for hyper parameter tuning. If the data set fits on each worker, it may be more efficient to use the SparkR UDF API to train several versions of the model at once.
The spark.lapply function enables you to perform the same task on multiple workers, by running a function over a list of elements. For each element in a list:
- Send the function to a worker.
- Execute the function.
- Return the result of all workers as a list to the driver.
In the following example, a support vector machine model is fit on the iris dataset with 3-fold cross validation while the cost is varied from 0.5 to 1 by increments of 0.1. The output is a list with the summary of the models for the various cost parameters.
%r library(SparkR) spark.lapply(seq(0.5, 1, by = 0.1), function(x) { library(e1071) model <- svm(Species ~ ., iris, cost = x, cross = 3) summary(model) })