Problem
You are attempting to use Apache Spark but are getting general failures and you cannot identify an obvious reason for the failures.
When you enable Compute Log Delivery, or review the failed stage tasks in the Spark UI, you see a No space left on device
error message in one (or more) of the executors.
dd/mm/yy hh:mm:ss ERROR Executor: Exception in task x.x in stage x.x (TID x)
java.lang.RuntimeException: Error writing to file "/local_disk0/…": No space left on device.
Cause
One or more of the executors does not have enough local disk space.
This can occur if you select a GCP VM instance that does not have a local SSD attached. This can result in situations where there is no disk space for required operations, such as shuffle operations.
Solution
Switch the executor instance type to a GCP instance type that includes local storage.
If changing the VM instance type is not an option, you can set gcp_attributes.boot_disk_size through the Databricks REST API for machines without local storage disk. This should help alleviate the problem.
This is an example of POST request body. Fill in your GCP details before sending.
{
"cluster_name": <name>,
"spark_version": <version>,
"node_type_id": <instance-name>,
"num_workers": 0,
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*, 4]"
},
"custom_tags": {
"ResourceClass": "SingleNode"
},
"gpc_attributes": {
"boot_disk_size": <size-in-gb>,
}
}
Please note that the amount of disk allocated depends on the workload, the amount of shuffle, and the join conditions. A good reference to start is double of the amount of input data. Databricks recommends a minimum of 100 GB.