Problem
You have a streaming job that runs continuously, processing batches of data as they arrive. However, the time interval between batches can vary significantly, ranging from one minute to two hours, leaving the cluster idle for extended periods throughout the day. You experience inefficient resource usage and increased costs.
Cause
You have not configured your streaming job to run only when new batches of data are available.
Solution
Use the Trigger type - File arrival
on your streaming jobs with the .trigger(availableNow=True)
at a streaming level.
File arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location.
To use file arrival triggers you must:
- Ensure your workspace has Unity Catalog enabled
- Use a storage location that’s either a Unity Catalog volume or an external location added to the Unity Catalog metastore
-
Have
READ
permissions to the storage location andCAN MANAGE
permissions on the job
If your workspace is not onboarded into Unity Catalog, you can still avoid running in continuous mode by scheduling your jobs to run every 15-20 minutes and reading all available files.
For more information, please refer to the Trigger jobs when new files arrive (AWS | Azure | GCP) documentation.