Use the HDFS API to read files in Python

There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts.

Use the following example code for S3 bucket storage.

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
fs = Path('s3a://<bucket-name>/<file-path>').getFileSystem(sc._jsc.hadoopConfiguration())
istream = fs.open(Path('s3a://<bucket-name>/<file-path>'))

reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))

while True:
  thisLine = reader.readLine()
  if thisLine is not None:
    print(thisLine)
  else:
    break

istream.close()

where

  • <bucket-name> is the name of the S3 bucket.
  • <file-path> is the full path to the file.

Use the following example code for Azure Blob storage.

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.blob.core.windows.net,
  "<account-access-key>")

fs = Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
istream = fs.open(Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/'))

reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))

while True:
  thisLine = reader.readLine()
  if thisLine is not None:
    print(thisLine)
  else:
    break

istream.close()

where

  • <account-name> is your Azure account name.
  • <container-name> is the container name.
  • <file-path> is the full path to the file.
  • <account-access-key> is the account access key.