About the script
- The AIM enablement prep script assists account admins in identifying and resolving External ID and group membership divergences between Databricks and your IdP.
- The script also identifies which workspaces are identity-federated. Since AIM only functions within identity-federated workspaces, it is optional but recommended for customers to enable identity federation across all workspaces in the account.
What are divergences?
Automatic Identity Management relies on provisioned identities having an externalId that matches the corresponding principal’s Object ID (unique ID for principal identification) in the Identity Provider (IdP). The externalId tells Databricks which principal in your Identity Provider a given Databricks identity corresponds to. Missing or incorrectly populated externalId values can prevent identity metadata from syncing between your IdP and Databricks and cause duplicate identities to appear in certain parts of the product.
Additionally, it is possible for group membership in your IdP and in Databricks to have diverged since Databricks members are mutable. This divergence in group memberships can create complications when turning off SCIM.
AIM is also only available for identity federation-enabled workspaces. Workspaces without identity federation enabled will continue to work when AIM is enabled, but without the benefits provided by AIM.
To help you identify these issues, our team has built a Python script that runs as a notebook, which audits the identities in your Databricks account and flags potential divergences.
Note
Account admin privileges are required to run this script.
Why does this matter?
The script will help the admin discover identities provisioned in Databricks whose external IDs do not have a corresponding match in Entra ID. It will also help detect divergences between Databricks and EntraId group memberships. Some illustrative issues that the tool will help detect and resolve are:
Issue A: Duplicate identities appearing in the product
- If two identities with the same name but different sources appear in the Databricks admin UIs, it often means that externalIds are misconfigured.
- This will cause one account identity and one IdP identity to appear in admin UIs and sharing modals.
- Which error categories should I fix to resolve this issue?
EXTERNAL_ID_NOT_IN_IDPEXTERNAL_ID_MATCH_NAME_MISMATCHNAME_MATCH_EXTERNAL_ID_MISMATCH
Issue B: Group members count in the IdP doesn’t match the count displayed in the Databricks UIs
- The databricks UIs show the member count in the IdP.
- This means members that exist in the Databricks group but don’t exist in the IdP memberships will not appear in the count (even though the membership works for permissions)
- Which error categories should I fix to resolve this issue?
GROUP_HAS_LOCAL_MEMBERS_WITHOUT_EXTERNAL_IDGROUP_HAS_LOCAL_MEMBERS_WITH_EXTERNAL_ID
Issue C: Provisioning an IdP group is failing
- When attempting to import an IdP group, you may face an error that says the group already exists in Databricks.
- This is likely caused by an existing account group that is reserving the name. This happens since Databricks enforces a unique groupname constraint.
- Which error categories should I fix to resolve this issue?
NAME_MATCH_EXTERNAL_ID_MISMATCH
Tool overview
This tool scans all users, groups, and service principals provisioned in a Databricks account and logs inconsistencies with their data from the IdP. The logs are written as CSV reports in the divergence/results folder.
The program runs in three phases:
- Phase 1 (Workspaces compatibility check):
- Lists all workspaces.
- Analyzes workspaces to find those that have incompatibility with AIM.
- Phase 2 (Gather identities):
- Fetches the IDs of all the provisioned identities in Databricks. If
TARGET_IDENTITIESis configured, it only fetches the identities specified byTARGET_IDENTITIES. - Writes the IDs to intermediate CSVs.
- If interrupted, this phase restarts from scratch on the next run.
- Fetches the IDs of all the provisioned identities in Databricks. If
- Phase 3 (Identities divergence check):
- Reads the gathered IDs.
- Compares each identity with its IdP counterpart in concurrent batches.
- Analyzes the responses of the endpoint and writes Databricks provisioned identities with divergences from the IdP to the output CSVs.
- Progress is saved after every batch, so if the program crashes, it resumes from the last completed batch.
- For Automatic Identity Management (AIM) enabled accounts, this performs a sync with the Identity Provider to preemptively fix issues.
Setup and running the tool
- Download the divergence ZIP file. In any workspace, click New > Notebook > File > Import. Select the ZIP file and click Import. Open the
divergencefolder. - Provide the script credentials to authenticate API calls. To do so, create an account-level service principal, give it an account admin role, and generate an OAuth secret. This is all available from the account admin UI. Then run the following commands via the CLI:
databricks auth login --host <workspace_url>databricks secrets create-scope divergencedatabricks secrets put-secret divergence client_id --string-value <CLIENT_ID>databricks secrets put-secret divergence client_secret --string-value <CLIENT_SECRET>
- Open
python/config.pyand fill in the following:-
ACCOUNTS_HOST: Your Databricks accounts console URL without any additional URL parameters (for example,https://accounts.azuredatabricks.net). -
ACCOUNT_ID: Your Databricks account ID. -
SECRET_SCOPE: The name of the secret scope you created above.
-
- Optionally adjust:
-
INCLUDE_USERS,INCLUDE_GROUPS,INCLUDE_SERVICE_PRINCIPALS: DefaultTrue. Set toFalseto skip that identity type. -
TARGET_IDENTITIES: Run only for specific identities instead of a full scan.
-
- Go to the
run_divergencenotebook and click Run all.
The script refreshes its access token automatically in the background every 30 minutes, so long runs do not require manual intervention. If it stops during phase 3, you can simply rerun all to resume progress. If for whatever reason you would like to restart the run from the beginning, delete the results folder.
Interpreting tool output
The script produces a couple of files and writes them to the divergence/results folder.
-
divergence_workspaces.csv- The workspaces that have incompatibilities with AIM.
-
identities_to_process_<principal_type>.csv- IDs of the identities that went through the divergence check.
- One file per principal type.
-
idp_divergence_<principal_type>.csv- The identities that had divergences with their IdP counterpart; see exact output columns below.
- Only identities with divergences are written to this file. Identities with no divergences are omitted.
- One file per principal type.
-
idp_divergence_failures.csv- The identities that failed the divergence check (and all retries).
- Principals of all types are aggregated in this file.
-
idp_divergence_progress.json- Temporary progress file for progress tracking and crash recovery.
- Can typically be ignored.
Workspaces (divergence_workspaces.csv) output columns
-
workspaceId: The Databricks workspace ID. -
errorCategories: Semicolon-separated error category names (see below).
Users (idp_divergence_users.csv) output columns
-
id: The Databricks internal ID. -
username: The username of the provisioned Databricks user. -
externalId: The external ID stored in Databricks on the provisioned Databricks user. -
externalIdWithUsernameMatch: Semicolon-separated external IDs of IdP users that are matched by username. -
errorCategories: Semicolon-separated error category names (see below).
Groups (idp_divergence_groups.csv) output columns
-
id: The Databricks internal ID. -
groupName: The name of the provisioned Databricks group. -
externalId: The external ID stored in Databricks on the provisioned Databricks group. -
externalIdsWithGroupnameMatch: Semicolon-separated external IDs of IdP groups that are matched by group name. -
localMembersNotInIdpInternalIds: Semicolon-separated internal IDs of group members that exist only in Databricks and have no external ID. -
externalMembersNotInIdpInternalIds: Semicolon-separated internal IDs of group members that have an external ID but are not members in the IdP group. -
errorCategories: Semicolon-separated error category names (see below).
Service principals (idp_divergence_service_principals.csv) output columns
-
id: The Databricks internal ID. -
applicationId: The application ID of the provisioned Databricks service principal. -
externalId: The external ID stored in Databricks on the provisioned Databricks service principal. -
externalIdWithAppIdMatch: External ID of the IdP service principal that is matched by application ID. -
errorCategories: Semicolon-separated error category names (see below).
Error category |
Description |
Action to take |
Potential issues if unresolved |
|---|---|---|---|
|
The workspace does not have identity federation enabled. |
Enable identity federation for the workspace from the account console. |
Account-level identities and IdP identities will not be available in identity federation disabled workspaces. These workspaces will still work as before, but without the capabilities of AIM. |
|
The provisioned identity has an external ID set, but it does not match any identity of the same type in the IdP. |
The externalId on the identity is misconfigured and should be updated to a valid externalId or removed altogether. If you update it to a new externalId, make sure there are no other identities that use it.
To determine which externalId to update to, see the |
If the externalId is supposed to be linked to an IdP identity, you may see duplicate identities (one with an incorrect externalId and one from the IdP). |
|
The Databricks identity has an external ID that maps to an identity with a different unique name in the IdP. |
For users and SPs, the username on Databricks needs to be updated. File a support ticket to do so.
For groups, check your account to see if any account groups are reserving the group name (Databricks enforces unique group names). If so, consider renaming the account group to a different group name so the external group can claim the name. |
When users log in, it frequently results in the creation of a second user with the same externalId but different usernames.
For groups, it will often lead to the external group not being able to sync its name with its IdP counterpart. |
|
The Databricks identity has a unique name match with an IdP identity that does not match its externalId. |
In most cases, the solution here is to update the Databricks externalId to match the IdP identity. It is important to double-check whether this is the correct solution, and it can vary based on your IdP and local data.
See the |
If the externalId is supposed to be linked to an IdP identity, you may see duplicate identities (one with an incorrect or no externalId and one from the IdP).
If trying to provision an IdP group with the same name, it may fail since there is an account group already using the name (Databricks enforces unique group names). |
|
The Databricks group has members without an externalId. These members do not have a corresponding membership in the IdP. |
In truly let the IdP be the source of truth, it is recommended to remove any locally added members from the group via SCIM. If the member should be a part of the group, it is recommended to create the member in the IdP and add it to the IdP group.
See the |
Members will inherit permissions from the IdP group. That said, these members will not appear in the IdP, which can make auditing permissions difficult. Member counts in the UI only reflect IdP member counts, which won’t reflect these members. |
|
The Databricks group has members with an externalId. These members do not have a corresponding membership in the IdP. |
To truly let the IdP be the source of truth, it is recommended to remove any locally added members from the group via SCIM. If the member should be a part of the group, it is recommended to add it to the IdP group.
See the |
Members will inherit permissions from the IdP group. That said, these members will not appear in the IdP, which can make auditing permissions difficult. Member counts in the UI only reflect IdP member counts, which won’t reflect these members. |
Update externalId for a principal
To update the externalId use Account SCIM to perform the operation. It is recommended to log any API calls to ease rollback if any issues come up during the process.
PATCH https://<accountUrl>/api/2.1/accounts/<accountId>/scim/v2/<Users|Groups|ServicePrincipals>/<databricksId>
{
"schemas": ["urn:ietf:params:scim:api:messages:2.0:PatchOp"],
"Operations": [
{
"op": "replace",
"path": "externalId",
"value": "<newExternalId>"
}
]
}Remove externalId for a group
To remove the externalId for a group you can use Account SCIM to perform the operation. It is recommended to log any API calls to ease rollback if any issues come up during the process.
Note
At the moment Databricks only supports this operation for groups:
PATCH https://<accountUrl>/api/2.1/accounts/<accountId>/scim/v2/Groups/<databricksId>
{
"schemas": ["urn:ietf:params:scim:api:messages:2.0:PatchOp"],
"Operations": [
{
"op": "replace",
"path": "externalId",
"value": ""
}
]
}