java.lang.OutOfMemoryError error when using collect() from sparklyr

Use arrow_collect() in a custom function to avoid Spark’s 2GB limit when collecting large datasets in R.

Written by Shyamprasad Miryala

Last published at: March 3rd, 2025

Problem

When attempting to use collect() using the sparklyr package to collect large results from Apache Spark into an R session, you get a java.lang.OutOfMemoryError error message.

 

Cause

Spark has a 2GB limit on the amount of data that can be collected in an R session. When the dataset exceeds this limit, Spark generates an out of memory error message. This limit is not configurable on the Spark side so you must use an alternative solution to collect larger datasets.

 

Solution

One way to avoid the out of memory error is by using the arrow_collect() function from sparklyr. By using arrow_collect() you can collect data incrementally by using a callback function.

You can write a custom function, collect_result(), that leverages this functionality to collect large datasets.

This custom function works by specifying a callback function that appends each batch of data to a list, which is then combined using rbindlist() from the data.table package.

To use this custom function, replace collect() with collect_result() in your code

results <- tbl(sc, "df") %>%
 collect_result()

Example collect_result() code

%r

collect_result <- function(tbl, ...) {
 collected <- list()
 sparklyr:::arrow_collect(tbl, ..., callback = function(batch_df) {
   collected <<- c(collected, list(batch_df))
 })
 data.table::rbindlist(collected)
}