Use the HDFS API to read files in Python

Learn how to read files directly by using the HDFS API in Python.

Written by arjun.kaimaparambilrajan

Last published at: May 19th, 2022

There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts.

AWS

Use the following example code for S3 bucket storage.

%python

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()
fs = Path('s3a://<bucket-name>/<file-path>').getFileSystem(sc._jsc.hadoopConfiguration())
istream = fs.open(Path('s3a://<bucket-name>/<file-path>'))

reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))

while True:
  thisLine = reader.readLine()
  if thisLine is not None:
    print(thisLine)
  else:
    break

istream.close()

where

  • <bucket-name> is the name of the S3 bucket.
  • <file-path> is the full path to the file.
Delete

Azure

Use the following example code for Azure Blob storage.

%python

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.blob.core.windows.net,
  "<account-access-key>")

fs = Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
istream = fs.open(Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/'))

reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))

while True:
  thisLine = reader.readLine()
  if thisLine is not None:
    print(thisLine)
  else:
    break

istream.close()

where

  • <account-name> is your Azure account name.
  • <container-name> is the container name.
  • <file-path> is the full path to the file.
  • <account-access-key> is the account access key.
Delete


Was this article helpful?