Create table in overwrite mode fails when interrupted

Learn how to troubleshoot failures that occur when you rerun an Apache Spark write operation by cancelling the currently running job.

Written by Adam Pavlacka

Last published at: May 10th, 2022

Problem

When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:

Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`').
The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;

Cause

This problem is due to a change in the default behavior of Spark in version 2.4.

This problem can occur if:

  • The cluster is terminated while a write operation is in progress.
  • A temporary network issue occurs.
  • The job is interrupted.

Once the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. Basically, the problem is that a metadata directory called _STARTED isn’t deleted automatically when Databricks tries to overwrite it.

You can reproduce the problem by following these steps:

  1. Create a DataFrame:
    val df = spark.range(1000)
  2. Write the DataFrame to a location in overwrite mode:
    df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable")
  3. Cancel the command while it is executing.
  4. Re-run the write command.

Solution

Set the flag spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation to true. This flag deletes the _STARTED directory and returns the process to the original state. For example, you can set it in the notebook:

%python

spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")

Or you can set it in the cluster level Spark config (AWS | Azure | GCP):

spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation true

Another option is to manually clean up the data directory specified in the error message. You can do this with dbutils.fs.rm.

%scala

dbutils.fs.rm("<path-to-directory>", true)