Problem
Your job fails with a Java IndexOutOfBoundsException error message:
java.lang.IndexOutOfBoundsException: index: 0, length: <number> (expected: range(0, 0))
When you review the stack trace you see something similar to this:
Py4JJavaError: An error occurred while calling o617.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 2195, 10.207.235.228, executor 0): java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0)) at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716) at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954) at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508) at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239) at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066) at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287) at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151) at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0)) at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716) at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954) at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508) at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239) at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066) at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287) at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151) at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)
Cause
This error occurs due to an arrow buffer limitation. When groupby() is used along with applyInPandas it results in this error.
Solution
You can work around the issue by setting the following value in your cluster's Spark config (AWS | Azure | GCP):
spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled=true
This setting allows groupby() to function correctly with pandas operations.