Problem: Unable to Read Files and List Directories in a WASB Filesystem

Problem

When you try reading a file on WASB with Spark, you get the following exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 19, 10.139.64.5, executor 0): shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

When you try listing files in WASB using dbutils.fs.ls or the Hadoop API, you get the following exception:

java.io.FileNotFoundException: File/<some-directory> does not exist.

Cause

The WASB filesystem supports three types of blobs: block, page, and append.

  • Block blobs are optimized for upload of large blocks of data (the default in Hadoop).
  • Page blobs are optimized for random read and write operations.
  • Append blobs are optimized for append operations.

See Understanding block blobs, append blobs, and page blobs for details.

The errors described above occur if you try to read an append blob or list a directory that contains only append blobs. The Databricks and Hadoop Azure WASB implementations do not support reading append blobs. Similarly when listing a directory, append blobs are ignored.

There is no workaround to enable reading append blobs or listing a directory that contains only append blobs. However, you can use either Azure CLI or Azure Storage SDK for Python to identify if a directory contains append blobs or a file is an append blob.

You can verify whether a directory contains append blobs by running the following Azure CLI command:

az storage blob list \
  --auth-mode key \
  --account-name <account-name> \
  --container-name <container-name> \
  --prefix <path>

The result is returned as a JSON document, in which you can easily find the blob type for each file.

If directory is large, you can limit number of results with the flag --num-results <num>.

You can also use Azure Storage SDK for Python to list and explore files in a WASB filesystem:

iter = service.list_blobs("container")
for blob in iter:
  if blob.properties.blob_type == "AppendBlob":
    print("\t Blob name: %s, %s" % (blob.name, blob.properties.blob_type))

Databricks does support accessing append blobs using the Hadoop API, but only when appending to a file.

Solution

There is no workaround for this issue.

Use Azure CLI or Azure Storage SDK for Python to identify if the directory contains append blobs or the object is an append blob.

You can implement either a Spark SQL UDF or custom function using RDD API to load, read, or convert blobs using Azure Storage SDK for Python.