Your tasks are running slower than expected.
You review the stage details in the Spark UI on your cluster and see that task deserialization time is high.
Cluster-installed libraries are only installed on the driver when the cluster is started. These libraries are only installed on the executors when the first tasks are submitted. The time taken to install the PyPI libraries is included in the task deserialization time.
Library installation only occurs on an executor where a task is launched. If a second executor is given a task, the installation process is repeated. The more libraries you have installed, the more noticeable the delay time when a new executor is launched.
If you are using a large number of PyPI libraries, you should configure your cluster to install the libraries on all the executors when the cluster is started. This results in a slight increase to the cluster launch time, but allows your job tasks to run faster because you don’t have to wait for libraries to install on the executors after the initial launch.
spark.databricks.libraries.enableSparkPyPI false to the cluster’s Spark Config and restart the cluster.