Problem
When attempting to use collect()
using the sparklyr
package to collect large results from Apache Spark into an R session, you get a java.lang.OutOfMemoryError
error message.
Cause
Spark has a 2GB limit on the amount of data that can be collected in an R session. When the dataset exceeds this limit, Spark generates an out of memory error message. This limit is not configurable on the Spark side so you must use an alternative solution to collect larger datasets.
Solution
One way to avoid the out of memory error is by using the arrow_collect()
function from sparklyr
. By using arrow_collect()
you can collect data incrementally by using a callback function.
You can write a custom function, collect_result()
, that leverages this functionality to collect large datasets.
This custom function works by specifying a callback function that appends each batch of data to a list, which is then combined using rbindlist()
from the data.table
package.
To use this custom function, replace collect()
with collect_result()
in your code
results <- tbl(sc, "df") %>%
collect_result()
Example collect_result() code
%r
collect_result <- function(tbl, ...) {
collected <- list()
sparklyr:::arrow_collect(tbl, ..., callback = function(batch_df) {
collected <<- c(collected, list(batch_df))
})
data.table::rbindlist(collected)
}