Multi-task workflows using incorrect parameter values

If parallel tasks running on the same cluster use Scala companion objects the wrong values can be used due to sharing a single class in the JVM.

Written by Rajeev kannan Thangaiah

Last published at: December 5th, 2022

Problem

Using key-value parameters in a multi task workflow is a common use case. It is normal to have multiple tasks running in parallel and each task can have different parameter values for the same key. These key-value parameters are read within the code and used by each task.

For example, assume you have four tasks: task1, task2, task3, and task4 within a workflow job. table-name is the parameter key and the parameter values are employee, department, location, and contacts.

When you run the job, you expect each task to get its own parameters. However if the application code uses Scala companion objects, you may notice one of the task parameters gets applied to all other tasks, instead of the respective parameters for each task getting applied. This produces inconsistent results.

Using our example, if the tasks are run in parallel using Scala companion objects, any one task parameter (for example, task4 parameter contacts) may get passed as the table name to the other three tasks.

Cause

When companion objects are used within application code, there is a mutable state in the companion object that is modified concurrently. Since all tasks run on the same cluster, this class is loaded once and all tasks run under the same Java virtual machine (JVM).

Solution

You can mitigate the issue by applying one of these solutions. The best choice depends on your specific use case.

Run the jobs sequentially (add dependencies in tasks).
Schedule each task on a different cluster.
Rewrite the code that loads the configuration so you are explicitly creating a new object and not using the companion object's shared state.

Databricks Help Center

Problem

Cause

Solution

Contact Us