Problem
While using Delta Lake on AWS S3 buckets with versioning enabled, you notice slower S3 API responses and increased storage costs.
Cause
When Delta Lake performs VACUUM
operations to remove obsolete files, these files become stale but are not entirely deleted when versioning is enabled. Instead, S3 retains them as noncurrent versions. Over time, the number of noncurrent object versions accumulates, leading to a bloated storage system with many unnecessary file versions.
Solution
Databricks recommends disabling S3 bucket versioning. Use the put-bucket-versioning
command.
$ aws s3api put-bucket-versioning \
--profile <your-profile> \
--bucket <your-bucket> \
--versioning-configuration Status=Suspended \
--endpoint https://<your-endpointurl>.com
If you need to keep versioning, implement a lifecycle management policy specifying a short period, such as seven days or less, to retain noncurrent object versions. Databricks recommends retaining no more than three versions of an object.
Example JSON to implement a lifecycle management policy
aws s3api put-bucket-lifecycle-configuration \
--bucket <your-bucket-name> \
--lifecycle-configuration '{
"Rules": [
{
"ID": "LimitNumberOfVersions",
"Status": "Enabled",
"Filter": {
"Prefix": ""
},
"NoncurrentVersionExpiration": {
"NoncurrentDays": 7,
"NewerNoncurrentVersions": 3
}
}
]
}'