Problem
You are using Auto Loader to ingest data for your ELT pipeline when you get an IllegalArgumentException: Please provide the source directory path with option `path` error message.
You get this error when you start an Auto Loader job, if either the path to the data or the data schema is not defined.
Error: IllegalArgumentException Traceback (most recent call last) <command-1874749868040573> in <module> 1 df = ( ----> 2 spark 3 .readStream.format("cloudFiles") 4 .options(**{ 5 "cloudFiles.format": "csv", /databricks/spark/python/pyspark/sql/streaming.py in load(self, path, format, schema, **options) 480 return self._df(self._jreader.load(path)) 481 else: --> 482 return self._df(self._jreader.load()) 483 484 def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 121 # Hide where the exception came from that shows a non-Pythonic 122 # JVM exception message. --> 123 raise converted from None 124 else: 125 raise IllegalArgumentException: Please provide the source directory path with option `path`
Cause
Auto Loader requires you to provide the path to your data location, or for you to define the schema. If you provide a path to the data, Auto Loader attempts to infer the data schema. If you do not provide the path, Auto Loader cannot infer the schema and requires you to explicitly define the data schema.
For example, if a value for <input-path> is not included in this sample code, the error is generated when you start your Auto Loader job.
%python df = spark.readStream.format("cloudFiles") \ .option(<cloudFiles-option>, <option-value>) \ .load()
If a value for <input-path> is included in this sample code, the Auto Loader job can infer the schema when it starts and will not generate the error.
%python df = spark.readStream.format("cloudFiles") \ .option(<cloudFiles-option>, <option-value>) \ .load(<input-path>)
Solution
You have to provide either the path to your data or the data schema when using Auto Loader.
If you do not specify the path, then the data schema MUST be defined.
For example, this sample code has the data schema defined, but no path specified. Because the data schema was defined, the path is optional. This does not generate an error when the Auto Loader job is started.
%python df = spark.readStream.format("cloudFiles") \ .option(<cloudFiles-option>, <option-value>) \ .schema(<schema>) \ .load()