Problem
You have an Apache Spark job that is failing with a Java assertion error java.lang.AssertionError: assertion failed: Conflicting directory structures detected.
Example stack trace
Caused by: org.apache.spark.sql.streaming.StreamingQueryException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list') === Streaming Query === Identifier: [id = aabc5549-cb4b-4e4e-9403-4e793f4824a0, runId = 4e743dda-909f-4932-9489-3dd0b364d811] Current Committed Offsets: {} Current Available Offsets: {CloudFilesSource[<file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt]: {'seqNum':423,'sourceVersion':1}} Current State: ACTIVE Thread State: RUNNABLE Logical Plan: CloudFilesSource[<file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:385) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:268) Caused by: java.lang.RuntimeException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list') at com.databricks.sql.fileNotification.autoIngest.CloudFilesErrors$.partitionInferenceError(CloudFilesErrors.scala:115) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.liftedTree1$1(CloudFilesSourceFileIndex.scala:65) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.partitionSpec(CloudFilesSourceFileIndex.scala:63) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50) at com.databricks.sql.fileNotification.autoIngest.CloudFilesSource.getBatch(CloudFilesSource.scala:361) ... 1 more Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt If provided paths are partition directories, please set 'basePath' in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:204) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parseP
Cause
You have conflicting directory paths in the storage location.
In the example stack trace, we see two conflicting directory paths.
- <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt
- <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt
Because these directories appear in the same hierarchy, an update in root or in a branch level can result in a conflict.
Solution
Avoid multiple concurrent updates in a hierarchical directory structure or updates happening in the same partition.
You should make multiple distinct paths for updates once a conflict is detected. Alternatively, you can add more partitions.
These example directories do not conflict.
- <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt1
- <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt2