There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts.
AWS
Use the following example code for S3 bucket storage.
%python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem conf = sc._jsc.hadoopConfiguration() fs = Path('s3a://<bucket-name>/<file-path>').getFileSystem(sc._jsc.hadoopConfiguration()) istream = fs.open(Path('s3a://<bucket-name>/<file-path>')) reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream)) while True: thisLine = reader.readLine() if thisLine is not None: print(thisLine) else: break istream.close()
where
- <bucket-name> is the name of the S3 bucket.
- <file-path> is the full path to the file.
Azure
Use the following example code for Azure Blob storage.
%python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem conf = sc._jsc.hadoopConfiguration() conf.set( "fs.azure.account.key.<account-name>.blob.core.windows.net, "<account-access-key>") fs = Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration()) istream = fs.open(Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/')) reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream)) while True: thisLine = reader.readLine() if thisLine is not None: print(thisLine) else: break istream.close()
where
- <account-name> is your Azure account name.
- <container-name> is the container name.
- <file-path> is the full path to the file.
- <account-access-key> is the account access key.