Cannot read audit logs due to duplicate columns

Case-sensitive parameter names cause a duplicate column error when reading audit logs.

Written by Adam Pavlacka

Last published at: July 22nd, 2022

Problem

You are trying to read audit logs and get an AnalysisException: Found duplicate column(s) error.

spark.read.format("json").load("dbfs://mnt/logs/<path-to-logs>/date=2021-12-07")
//  AnalysisException: Found duplicate column(s) in the data schema: `<some_column>`

Cause

From November 2021 to December 2021, a limited number of Databricks SQL audit logs were published with duplicate case-sensitive parameter names. This can break the schema inference and generate an error when you try to read audit logs generated during this time.

The following parameter names were duplicated:

| Correct name | Duplicate name |
| ------------ | -------------- |
| dataSourceId | DataSourceId   |
| alertId      | AlertId        |
| dashboardId  | DashboardId    |

Solution

Create a new cluster to read and repair the audit log files.

Turn case sensitivity on so that schema inference can properly load the data.

%scala

spark.conf.set("spark.sql.caseSensitive", true)
spark.read.format("json").load("dbfs://mnt/logs/pathToMyLogs/date=2021-12-07").write...
spark.conf.set("spark.sql.caseSensitive", false)

After the logs have been processed, turn case sensitivity off.

Delete

Warning

Enabling schema inference on shared clusters and/or clusters that perform other workloads could cause issues with other workloads.


Was this article helpful?