Conflicting directory structures error

You should use distinct paths in the storage location, otherwise conflicting directory structures may result in an error.

Written by ashish

Last published at: May 19th, 2022

Problem

You have an Apache Spark job that is failing with a Java assertion error java.lang.AssertionError: assertion failed: Conflicting directory structures detected.

Example stack trace

Caused by: org.apache.spark.sql.streaming.StreamingQueryException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list')
=== Streaming Query ===
Identifier: [id = aabc5549-cb4b-4e4e-9403-4e793f4824a0, runId = 4e743dda-909f-4932-9489-3dd0b364d811]
Current Committed Offsets: {}
Current Available Offsets: {CloudFilesSource[<file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt]: {'seqNum':423,'sourceVersion':1}}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
CloudFilesSource[<file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:385)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:268)
Caused by: java.lang.RuntimeException: There was an error when trying to infer the partition schema of the current batch of files. Please provide your partition columns explicitly by using: .option('cloudFiles.partitionColumns', 'comma-separated-list')
at com.databricks.sql.fileNotification.autoIngest.CloudFilesErrors$.partitionInferenceError(CloudFilesErrors.scala:115)
at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.liftedTree1$1(CloudFilesSourceFileIndex.scala:65)
at com.databricks.sql.fileNotification.autoIngest.CloudFilesSourceFileIndex.partitionSpec(CloudFilesSourceFileIndex.scala:63)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
at com.databricks.sql.fileNotification.autoIngest.CloudFilesSource.getBatch(CloudFilesSource.scala:361)
... 1 more
Caused by: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
<file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt
<file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt

If provided paths are partition directories, please set 'basePath' in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:204)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parseP

Cause

You have conflicting directory paths in the storage location.

In the example stack trace, we see two conflicting directory paths.

  • <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt
  • <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/clfy_x_clfy_evt

Because these directories appear in the same hierarchy, an update in root or in a branch level can result in a conflict.

Solution

Avoid multiple concurrent updates in a hierarchical directory structure or updates happening in the same partition.

You should make multiple distinct paths for updates once a conflict is detected. Alternatively, you can add more partitions.

These example directories do not conflict.

  • <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt1
  • <file-system>://domain.com/km/gold/cfy_gold/clfy_x_clfy_evt/evt=clfy_x_clfy_evt2