Job fails with Java IndexOutOfBoundsException error

When groupby() is used along with applyInPandas it generates an exception due to an arrow buffer limitation.

Written by rakesh.parija

Last published at: December 21st, 2022

Problem

Your job fails with a Java IndexOutOfBoundsException error message:

java.lang.IndexOutOfBoundsException: index: 0, length: <number> (expected: range(0, 0))

When you review the stack trace you see something similar to this:

Py4JJavaError: An error occurred while calling o617.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 2195, 10.207.235.228, executor 0): java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))
    at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)
    at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)
    at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)
    at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)
    at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)
    at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)
    at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)
    at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)


Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))
    at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)
    at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)
    at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)
    at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)
    at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)
    at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)
    at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)
    at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)

Cause

This error occurs due to an arrow buffer limitation. When groupby() is used along with applyInPandas it results in this error.

Solution

You can work around the issue by setting the following value in your cluster's Spark config (AWS | Azure | GCP):

spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled=true

This setting allows groupby() to function correctly with pandas operations.