Problem
You are trying to read audit logs and get an AnalysisException: Found duplicate column(s) error.
spark.read.format("json").load("dbfs://mnt/logs/<path-to-logs>/date=2021-12-07") // AnalysisException: Found duplicate column(s) in the data schema: `<some_column>`
Cause
From November 2021 to December 2021, a limited number of Databricks SQL audit logs were published with duplicate case-sensitive parameter names. This can break the schema inference and generate an error when you try to read audit logs generated during this time.
The following parameter names were duplicated:
| Correct name | Duplicate name | | ------------ | -------------- | | dataSourceId | DataSourceId | | alertId | AlertId | | dashboardId | DashboardId |
Solution
Create a new cluster to read and repair the audit log files.
Turn case sensitivity on so that schema inference can properly load the data.
%scala spark.conf.set("spark.sql.caseSensitive", true) spark.read.format("json").load("dbfs://mnt/logs/pathToMyLogs/date=2021-12-07").write... spark.conf.set("spark.sql.caseSensitive", false)
After the logs have been processed, turn case sensitivity off.