How to calculate the Databricks file system (DBFS) S3 API call cost

Learn how to calculate the Databricks file system (DBFS) S3 API call cost.

Written by Adam Pavlacka

Last published at: March 8th, 2022

The cost of a DBFS S3 bucket is primarily driven by the number of API calls, and secondarily by the cost of storage. You can use the AWS CloudTrail logs to create a table, count the number of API calls, and thereby calculate the exact cost of the API requests.

  1. Obtain the following information. You may need to contact your AWS Administrator to get it.
    • API call cost for calls involving List, Put, Copy, or Post (the example script uses the price per thousand calls: 0.005/1000)
    • API call cost for calls involving Head, Get, or Select (below, 0.0004/1000)
    • Account ID for the Databricks control plane account (below, 414351767826)
  2. Copy the CloudTrail logs to an S3 bucket and use the following Apache Spark code to read the logs and create a table:
    %python
    
    spark.read.json("s3://dbc-root-cloudwatch/*/*/*/*/*/*/*").createOrReplaceTempView("f_cloudwatch")	
  3. Substitute the accountIDand the API call costs into the following query. This query takes the CloudTrail results collected during a specific time interval, counts the number of API calls being made from the Databricks control plane account, and calculates the cost.
    %sql
    
    select
    Records.userIdentity.accountId,
    Records.eventName,
    count(*) as api_calls,
    (case when Records.eventName like 'List%' or Records.eventName like 'Put%' or Records.eventName like 'Copy%' or Records.eventName like 'Post%' then 0.005/1000
     when Records.eventName like 'Head%' or Records.eventName like 'Get%' or Records.eventName like 'Select%' then 0.0004/1000
     else 0 end) * count(*) as api_cost
    from
    (select explode(Records) as Records
    from f_cloudwatch
    where Records is not null)
    -- where Records.userIdentity.accountId = '414351767826'
    group by 1,2
    order by 4 desc
    limit 10;
  4. Run the query to generate a table. The resulting table shows the number of API calls and the cost of those calls.

Additional API costs are often due to checkpointing directories for streaming jobs. Databricks recommends deleting old checkpointing directories if they are no longer referenced.