Problem
You want to query the system.compute.clusters
and system.billing.usage
tables to gain insights into DBU consumption per cluster and identify all-purpose and job clusters within your workspace.
Cause
You may want these insights for multiple reasons, including old deleted clusters, potential field variations, and SQL query adjustments.
Deleted clusters
Clusters deleted before specific dates may not be included in the query results.
Field variations
Differences in the change_time
and delete_time
fields could sometimes cause cluster information to appear differently than expected.
SQL query adjustments
Queries may need to incorporate recent changes or deletions in cluster data to provide more accurate results.
Solution
These steps can be implemented to achieve more refined results by utilizing multiple system tables.
Enable required system schemas
- While the
system.billing
schema is enabled by default, thesystem.compute
schema needs to be enabled manually. - For more information, please review the Enable system table schemas (AWS | Azure | GCP) documentation.
Refine SQL queries
- Adjust queries to include recent cluster changes and take deleted clusters into account for more accurate results.
- Applying filters and joins using fields like
change_time
anddelete_time
can help provide a comprehensive view of cluster states.
Ensure accurate results from the system.compute.clusters
table
- Ensure your SQL query correctly accounts for the latest changes and deletions in cluster data.
Example code
In this example code block, we are fetching the clusters present in the workspace created via UI and API.
SELECT DISTINCT
c.workspace_id,
c.cluster_id,
c.cluster_name,
MAX(c.create_time) AS first_created,
MAX(c.change_time) AS last_changed,
c.driver_node_type,
c.worker_node_type,
c.min_autoscale_workers,
c.max_autoscale_workers,
c.auto_termination_minutes,
c.dbr_version,
c.cluster_source
FROM
system.compute.clusters c
INNER JOIN (
SELECT
cluster_id,
MAX(change_time) AS max_change_time
FROM
system.compute.clusters
GROUP BY
cluster_id
) m
ON c.cluster_id = m.cluster_id
AND c.change_time = m.max_change_time
WHERE
c.delete_time IS NULL
AND c.cluster_source IN ('UI', 'API')
AND c.workspace_id = <enter-your-workspace-id>
GROUP BY
c.workspace_id,
c.cluster_id,
c.cluster_name,
c.driver_node_type,
c.worker_node_type,
c.min_autoscale_workers,
c.max_autoscale_workers,
c.auto_termination_minutes,
c.dbr_version,
c.cluster_source
ORDER BY
c.workspace_id,
c.cluster_id;
- Retrieve DBU consumption per cluster.
Example code
In this example code block, we are fetching the DBUs consumed per cluster on a monthly basis.
SELECT
date_format(u.usage_date, 'yyyy-MM') AS `Month`,
c.cluster_name AS `Cluster Name`,
c.cluster_id AS `Cluster ID`,
SUM(u.usage_quantity) AS `Total DBUs Consumed`
FROM
system.billing.usage u
INNER JOIN
system.compute.clusters c
ON u.usage_metadata.cluster_id = c.cluster_id
WHERE
c.cluster_name LIKE <cluster-name>
AND c.cluster_id LIKE <cluster-id>
GROUP BY
date_format(u.usage_date, 'yyyy-MM'),
c.cluster_name,
c.cluster_id
ORDER BY
`Month` ASC,
`Cluster Name` ASC;
- Verify that the
delete_time
field is correctly handled in your queries to exclude deleted clusters.
SELECT * FROM system.compute.clusters WHERE delete_time IS NULL;
- Verify the sample queries with known data to ensure they return the expected results. If needed, customize the examples based on your specific workspace and cluster configurations.
For more information regarding the system.compute.clusters
table, please review the Compute system tables reference (AWS | Azure | GCP) documentation.