Reading Avro files with Structured Streaming using wildcards in the path fails with error ArrayIndexOutOfBoundsException

Add an option to enable recursively reading bulk Avro files using a wildcard path.

Last published at: October 23rd, 2024

Problem

When you try to read Avro files with Structured Streaming using wildcards in the path, the read fails with an error.

java.lang. ArrayIndexOutOfBoundsException: 0
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:178)

Cause

By default, Databricks does not perform a recursive file lookup, which means that it will not read files in subdirectories in a specified path.

Solution

Add `.option("recursiveFileLookup", "true"`) to Apache Spark read commands. This option enables recursive file lookup, ensuring that Databricks reads files in subdirectories of the specified path.

Example with Avro files

```scala
val df = spark
 .readStream
 .schema(sourceSchema)
 .option("recursiveFileLookup", "true")
 .format("avro")
 .load(basePath)
display(df)
```

Example with Parquet files

```scala
val df = spark
 .readStream
 .schema(sourceSchema)
 .option("recursiveFileLookup", "true")
 .parquet(basePath)
display(df)
```

Databricks Help Center

Problem

Cause

Solution

Example with Avro files

Example with Parquet files

Contact Us