Explicit path to data or a defined schema required for Auto loader

If you do not specify an explicit path to your data or define your data schema, you get an IllegalArgumentException error when you start an Auto loader job.

Written by Jose Gonzalez

Last published at: October 12th, 2022
Delete

Info

This article applies to Databricks Runtime 9.1 LTS and above.

Problem

You are using Auto Loader to ingest data for your ELT pipeline when you get an IllegalArgumentException: Please provide the source directory path with option `path` error message.

You get this error when you start an Auto Loader job, if either the path to the data or the data schema is not defined.

Error:
IllegalArgumentException                 Traceback (most recent call last)
<command-1874749868040573> in <module>
     1 df = (
----> 2    spark
     3    .readStream.format("cloudFiles")
     4    .options(**{
     5        "cloudFiles.format": "csv",
/databricks/spark/python/pyspark/sql/streaming.py in load(self, path, format, schema, **options)
   480            return self._df(self._jreader.load(path))
   481        else:
--> 482            return self._df(self._jreader.load())
   483
   484    def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
  1302
  1303        answer = self.gateway_client.send_command(command)
-> 1304        return_value = get_return_value(
  1305            answer, self.gateway_client, self.target_id, self.name)
  1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
   121                # Hide where the exception came from that shows a non-Pythonic
   122                # JVM exception message.
--> 123                raise converted from None
   124            else:
   125                raise
IllegalArgumentException: Please provide the source directory path with option `path`

Cause

Auto Loader requires you to provide the path to your data location, or for you to define the schema. If you provide a path to the data, Auto Loader attempts to infer the data schema. If you do not provide the path, Auto Loader cannot infer the schema and requires you to explicitly define the data schema.

For example, if a value for <input-path> is not included in this sample code, the error is generated when you start your Auto Loader job.

%python

df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.load()

If a value for <input-path> is included in this sample code, the Auto Loader job can infer the schema when it starts and will not generate the error.

%python

df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.load(<input-path>)

Solution

You have to provide either the path to your data or the data schema when using Auto Loader.

If you do not specify the path, then the data schema MUST be defined.

For example, this sample code has the data schema defined, but no path specified. Because the data schema was defined, the path is optional. This does not generate an error when the Auto Loader job is started.

%python

df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load()