When you attempt to rerun an Apache Spark write operation by cancelling the currently running job, the following error occurs:
Error: org.apache.spark.sql.AnalysisException: Cannot create the managed table('`testdb`.` testtable`'). The associated location ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already exists.;
This problem is due to a change in the default behavior of Spark in version 2.4.
This problem can occur if:
- The cluster is terminated while a write operation is in progress.
- A temporary network issue occurs.
- The job is interrupted.
Once the metastore data for a particular table is corrupted, it is hard to recover except by dropping the files in that location manually. Basically, the problem is that a metadata directory called
_STARTED isn’t deleted automatically when Databricks tries to overwrite it.
You can reproduce the problem by following these steps:
Create a DataFrame:
val df = spark.range(1000)
Write the DataFrame to a location in overwrite mode:
Cancel the command while it is executing.
Set the flag
true. This flag deletes the
_STARTED directory and returns the process to the original state.
For example, you can set it in the notebook:
Or you can set it as a cluster level Spark configuration:
Another option is to manually clean up the data directory specified in the error message. You can do this with