How-to Introduction
You want to know which users are assigned the account admin role in your Databricks environment.
In smaller-scale operations, the easiest way to check is through your account admin portal. You can navigate to User Management and check the Only account admins checkbox.
When your user base grows significantly, Databricks offers a REST API call to list users programmatically instead. For details and use, refer to the List users (AWS | Azure) API documentation.
By design, this API call has a soft limit of 10,000 results per page. To retrieve the full user list when there are over 10,000 results, use the API’s startIndex
and count
parameters to paginate through all users and apply a filter for users with the account admin role.
Instructions
To begin, Databricks recommends using OAuth Machine-to-Machine (M2M) authentication when using the REST API. For more information, review the Authorize unattended access to Databricks resources with a service principal using OAuth (AWS | Azure) documentation. This documentation explains how to use the client_id
and client_secret
to generate an access token.
Next, in the following example code, store client_id
and client_secret
as secrets. Also set your workspace URL as a value for the variable workspace_url
. (If you work in a notebook, the code retrieves the URL dynamically.)
Replace the following variables before running the example code.
-
<oidc-scope>
is the scope for where you store the client id and secret. -
<oidc-client-id>
is the key for storing your client id secret value. -
<oidc-client-secret>
is the key for storing your client secret value.
import requests
client_id = dbutils.secrets.get(scope="<oidc-scope>", key="<oidc-client-id>")
client_secret = dbutils.secrets.get(scope="<oidc-scope>", key="<oidc-client-secret>")
workspace_url = spark.conf.get("spark.databricks.<your-workspace-url>")
token_endpoint_url = f"https://{workspace_url}/oidc/v1/token"
if not client_id or not client_secret:
raise ValueError("CLIENT_ID or CLIENT_SECRET environment variables are not set.")
try:
response = requests.post(
token_endpoint_url,
auth=(client_id, client_secret),
data={"grant_type": "client_credentials", "scope": "all-apis"},
)
response.raise_for_status()
token_data = response.json()
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
except ValueError as e:
print(f"Error: {e}")
except KeyError as e:
print(f"Error: Key {e} not found in response json")
Once you have authenticated and defined your workspace_url
variable, get the total user count, set a batch count for pagination, and make the API call to fetch the user results with pagination. Store the results in a list.
import requests
# Step 1: Get the total user count
api_url = f"http://{workspace_url}/api/2.0/account/scim/v2/Users?attributes=id"
total_count_response = requests.get(
api_url,
headers={"Authorization": "Bearer " + token_data["access_token"]},
timeout=900,
)
# Print the time it took to retrieve the data
print(f"Took: {total_count_response.elapsed.total_seconds()} seconds")
# Get the total number of users
total_user_count = total_count_response.json()["totalResults"]
# Set the batch count for pagination, default to 2000
batch_cnt = 2000
# Adjust batch size if the total user count is smaller
if total_user_count <= batch_cnt:
batch_cnt = total_user_count
# Step 2: Initialize an empty list for user details
user_tuple_list = list()
# Step 3: Loop through users using pagination logic
for i in range(0, total_user_count, batch_cnt):
start_index = i + 1
count = batch_cnt
print("Processing users from index:", start_index, "count:", count)
# API call to fetch users with pagination
url = f"https://{workspace_url}/api/2.0/account/scim/v2/Users?attributes=userName,id,active,roles,email&startIndex={start_index}&count={count}"
user_response = requests.get(
url,
headers={"Authorization": "Bearer " + token_data["access_token"]},
timeout=900,
)
# Parse response and append user details to list
for resource in user_response.json().get("Resources", []):
roles = [role["value"] for role in resource.get("roles", [])]
user_tuple_list.append((resource["id"], resource["userName"], roles))
Use Apache Spark SQL to define a schema for your user data and convert the list you created previously into a DataFrame. Then you can display a table of email addresses filtered on the account admin role.
from pyspark.sql.types import StructField, StringType, StructType, ArrayType
# Define schema for user data
user_schema = StructType([
StructField("id", StringType(), True),
StructField("userName", StringType(), True),
StructField("roles", ArrayType(StringType()), True)
])
# Convert the user list into a DataFrame
user_df = spark.createDataFrame(user_tuple_list, user_schema)
user_df.createOrReplaceTempView("account_user_id_mapping")
display(spark.sql("SELECT userName as email, roles FROM account_user_id_mapping where array_contains(roles,'account_admin')"))
The following image shows the output you see in your notebook after running the example code, a list of emails with the role of account_admin.