S3 bucket storage keeps growing due to logs

You should not use DFBS root for logs. If you must, enable a lifecycle policy.

Written by kunal.jadhav

Last published at: April 7th, 2025

Problem

You notice a large number of log files accumulating in your Amazon S3 bucket, consuming significant storage space.

 

Cause

If you are using the DBFS bucket like dbfs:/cluster-logs/<clusterId>, be aware that various workspace actions use default locations. Enabling cluster logs delivery on the same bucket path can lead to increasing storage usage over time. Since Databricks does not automatically remove these files, it is your responsibility to monitor and manage storage cleanup.

 

Info

Databricks strongly advises against storing production data or sensitive information in the DBFS root.

 

 

Solution

You should not use the DBFS root for storing cluster logs. Configure a different S3 bucket or volume to store cluster logs. This allows better control over storage management and avoids potential issues with default Databricks directories. For more information, review the Compute log delivery documentation.

Additionally, consider archiving or backing up important logs to a separate storage location for long-term retention.

If you must use the DBFS root for storing cluster logs, you can apply a lifecycle policy. Ensure it is only applied to the /cluster_logs subfolder and not at the bucket level. This prevents the accidental deletion of other essential system directories used by Databricks.

For more information on DFBS root, review the Recommendations for working with DBFS root documentation.