Problem
You use the following code to parse data using a notebook or Auto Loader.
df = (spark. readStream
format ("cloudFiles")
.option ("cloudFiles. format", "csv" )
.option("useStrictGlobber", "true")
.option ("header", "true")
.option ("sep", ";")
.option ("cloudFiles.schemaLocation"
.schema_location)
.load (source_path) )
You then receive the following error.
Py4JJavaError: An error occurred while calling o693.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.139.64.10 executor driver): com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 20480
Hint: Number of columns processed may have exceeded limit of 20480 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Cause
Univocity parser, a Java library used by Apache Spark internally to parse CSV/text files, is causing the error. When univocity parser cannot properly parse text data, it throws a TextParsingException
runtime error.
The failure to parse text data arises when a row is malformed.
Solution
First, verify that the delimiter used in your read operation matches the format of your input files.
In a notebook, run the following code to ensure your read configuration is accurate. This code uses a semicolon delimiter. If you’re using a comma, you can set “,”
as the second parameter.
df = spark.read.option("delimiter", ";").csv("</path/to/file.csv>")
If you’re unsure about which delimiter is in use, open the file directly using Databricks File System (DBFS) or use the Data tab in the Databricks UI to preview it.
Then make the necessary corrections to the data.