Problem
When you need to load a high volume of small JSON files (for example, a million or more) into a Delta table, you have two query options. Query 1 uses INSERT INTO
with read_files()
and succeeds.
Query 2 uses COPY INTO
using a direct file directory pattern and fails with the following error.
ERROR: Job aborted due to stage failure: java.lang.OutOfMemoryError: Java heap space"
Query 1 example
INSERT INTO <catalog>.<schema>.<table> (
SELECT
<col-1>,
<col-2>,
<col-3>
FROM
read_files(
"/Volumes/<folder-path>",
schema =>
"<col-1> STRING, <col-2> STRING, <col-3> STRING"
)
)
Query 2 example
COPY INTO <catalog>.<schema>.<table> FROM (
SELECT
<col-1>::STRING <col-1>,
<col-2>::STRING <col-2>,
<col-3>::STRING <col-3>
FROM '/Volumes/<folder-path>'
) FILEFORMAT = JSON PATTERN = '*.json' FORMAT_OPTIONS ('multiline' = 'true')
Cause
The core difference between the two SQL queries is how they handle file listings and metadata management.
The read_files()
function reads files directly without relying on recursive directory listing at the driver level. It instead constructs the DataFrame using an explicitly provided schema and the specified file paths or prefix, minimizing driver memory operations. In the following image (which shows a test reproduction environment with dummy data), no file and directory listing operation appears.
The COPY INTO
command retrieves metadata about all files in the specified source directory/prefix to match the provided pattern, PATTERN
, and applies format-specific options FILE FORMAT
and FORMAT_OPTIONS
.
The command uses the metadata service to populate a file index, InMemoryFileIndex
, a process that happens in the driver’s memory. In the following image (which shows a test reproduction environment with dummy data), an additional Apache Spark job is launched to perform a file and directory listing operation. The additional job is highlighted with a red box.
For directories containing a large number of files, this additional operation (which runs purely on the driver) can lead to a Java heap out of memory (OOM) error.
Solution
You can use INSERT INTO
with the read_files()
function which succeeds. This approach avoids the driver-intensive operation of listing leaf files and directories, which can cause a Java heap OOM error.
If you need to use the COPY INTO
command, consider the following best practices to avoid Java heap OOM errors.
- Increase the driver memory allocation in your cluster configuration.
- Split the source data into smaller directories with fewer files to reduce the number of files processed during the `InMemoryFileIndex` operation.
- Design your data layout with partitioning and optimal file sizes to minimize the impact of file listing operations on driver memory.