Reading Avro files with Structured Streaming using wildcards in the path fails with error ArrayIndexOutOfBoundsException

Add an option to enable recursively reading bulk Avro files using a wildcard path.

Written by mounika.tarigopula

Last published at: October 23rd, 2024

Problem

When you try to read Avro files with Structured Streaming using wildcards in the path, the read fails with an error. 

java.lang. ArrayIndexOutOfBoundsException: 0
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:178)

 

Cause

By default, Databricks does not perform a recursive file lookup, which means that it will not read files in subdirectories in a specified path. 

 

Solution

Add `.option("recursiveFileLookup", "true"`) to Apache Spark read commands. This option enables recursive file lookup, ensuring that Databricks reads files in subdirectories of the specified path.

 

Example with Avro files

```scala
val df = spark
 .readStream
 .schema(sourceSchema)
 .option("recursiveFileLookup", "true")
 .format("avro")
 .load(basePath)
display(df)
```

 

Example with Parquet files

```scala
val df = spark
 .readStream
 .schema(sourceSchema)
 .option("recursiveFileLookup", "true")
 .parquet(basePath)
display(df)
```