Idle clusters causing inefficient resource use and increased costs

Set file arrival triggers.

Written by lucas.rocha

Last published at: September 12th, 2024

Problem

You have a streaming job that runs continuously, processing batches of data as they arrive. However, the time interval between batches can vary significantly, ranging from one minute to two hours, leaving the cluster idle for extended periods throughout the day. You experience inefficient resource usage and increased costs.

Cause

You have not configured your streaming job to run only when new batches of data are available.

Solution

Use the Trigger type - File arrival on your streaming jobs with the .trigger(availableNow=True) at a streaming level.

File arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location.

To use file arrival triggers you must:

  • Ensure your workspace has Unity Catalog enabled
  • Use a storage location that’s either a Unity Catalog volume or an external location added to the Unity Catalog metastore
  • Have READ permissions to the storage location and CAN MANAGE permissions on the job

If your workspace is not onboarded into Unity Catalog, you can still avoid running in continuous mode by scheduling your jobs to run every 15-20 minutes and reading all available files.

For more information, please refer to the Trigger jobs when new files arrive (AWSAzureGCP) documentation.