Reading large DBFS-mounted files using Python APIs

Learn how to resolve errors when reading large DBFS-mounted files using Python APIs.

Written by Adam Pavlacka

Last published at: May 19th, 2022

This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs.


If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error:

/databricks/python/local/lib/python2.7/site-packages/pandas/ in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)()
/databricks/python/local/lib/python2.7/site-packages/pandas/ in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6883)()
IOError: Initializing from file failed


The error occurs because one argument in the Python method to read a file is a signed int, the length of the file is an int, and if the object is a file larger than 2GB, the length can be larger than maximum signed int.


Move the file from dbfs:// to local file system (file://). Then read using the Python API. For example:

  1. Copy the file from dbfs:// to file://:
    %fs cp dbfs:/mnt/large_file.csv file:/tmp/large_file.csv
  2. Read the file in the pandasAPI:
    import pandas as pd

Was this article helpful?