Cluster terminates automatically even if operation or command is still executing

Consider using Databricks Jobs for long-running operations.

Written by raahat.varma

Last published at: September 12th, 2024

Problem

While executing a long-running operation or command within a notebook on an interactive cluster, you notice the cluster terminates automatically. This disrupts your workflows, results in incomplete processes, and requires the need to restart the operation.

Cause

The Workspace File System (WSFS) token for interactive sessions has a 36-hour timeout. 

Solution

Consider using Databricks Jobs for long-running operations. 

Databricks Jobs have a 30-day timeout, which is more suitable for extensive calculations. To create and run a Databricks Job:

  1. Navigate to the Databricks workspace and select the Jobs tab.
  2. Click on 'Create Job' and configure the job settings, including the notebook to run and the cluster to use.
  3. Set the schedule and timeout settings to accommodate the long-running calculation.
  4. Save and run the job.

For detailed instructions, refer to the Create and run Databricks Jobs (AWSAzureGCP) documentation. 

Note

Additionally, we recommend coordinating with the engineering team to ensure the workspace token limitation is properly documented and to explore any potential configuration changes that could extend the token's validity period for interactive sessions.