You have an Apache Spark job that is triggered correctly, but remains idle for a long time before starting.
You have a Spark job that ran well for awhile, but goes idle for a long time before resuming.
- Cluster downscales to the minimum number of worker nodes during idle time.
- Driver logs don’t show any Spark jobs during idle time, but does have repeated information about metadata.
- Ganglia shows activity only on the driver node.
- Executor logs show no activity.
- After some time passes, cluster scales up and Spark jobs start or resume.
These symptoms indicate that there are a lot of file scan operations happening during this period of the job. Tables are read and consumed in downstream operations.
You see the file scan operation details when you review the SQL tab in the Spark UI. The queries appear to be completed, which makes it appear as though no work is being performed during this idle time.
The driver node is busy because it is performing the file listing and processing data (metadata containing schema and other information). This work only happens on the driver node, which is why you only see driver node activity in the Ganglia metrics during this time.
This issue becomes more pronounced if you have a large number of small files.
You should control the file size and number of files ingested at the source location by implementing a preprocessing step. You can also break down the ingestion into a number of smaller steps, so a smaller number of files have to be scanned at once.
Another option is to migrate your data store to Delta Lake, which uses transactional logs as an index for all the underlying files.