Query using COPY INTO using a direct file directory pattern fails with “ERROR: Job aborted due to stage failure” OOM error

Use INSERT INTO with read_files() instead.

Written by Amruth Ashoka

Last published at: April 30th, 2025

Problem

When you need to load a high volume of small JSON files (for example, a million or more) into a Delta table, you have two query options. Query 1 uses INSERT INTO with read_files() and succeeds. 

 

Query 2 uses COPY INTO using a direct file directory pattern and fails with the following error.

ERROR: Job aborted due to stage failure: java.lang.OutOfMemoryError: Java heap space"

 

Query 1 example

INSERT INTO <catalog>.<schema>.<table> (
  SELECT
    <col-1>,
    <col-2>,
    <col-3>
  FROM
    read_files(
      "/Volumes/<folder-path>",
      schema =>
        "<col-1> STRING, <col-2> STRING, <col-3> STRING"
    )
)

 

Query 2 example

COPY INTO <catalog>.<schema>.<table> FROM (
  SELECT
    <col-1>::STRING <col-1>,
    <col-2>::STRING <col-2>,
    <col-3>::STRING <col-3>
  FROM '/Volumes/<folder-path>'
) FILEFORMAT = JSON PATTERN = '*.json' FORMAT_OPTIONS ('multiline' = 'true')

 

 

Cause

The core difference between the two SQL queries is how they handle file listings and metadata management.

 

The read_files() function reads files directly without relying on recursive directory listing at the driver level. It instead constructs the DataFrame using an explicitly provided schema and the specified file paths or prefix, minimizing driver memory operations. In the following image (which shows a test reproduction environment with dummy data), no file and directory listing operation appears. 

 

The COPY INTO command retrieves metadata about all files in the specified source directory/prefix to match the provided pattern, PATTERN, and applies format-specific options FILE FORMAT and FORMAT_OPTIONS

 

The command uses the metadata service to populate a file index, InMemoryFileIndex, a process that happens in the driver’s memory. In the following image (which shows a test reproduction environment with dummy data), an additional Apache Spark job is launched to perform a file and directory listing operation. The additional job is highlighted with a red box. 

 

For directories containing a large number of files, this additional operation (which runs purely on the driver) can lead to a Java heap out of memory (OOM) error. 

 

Solution

You can use INSERT INTO with the read_files() function which succeeds. This approach avoids the driver-intensive operation of listing leaf files and directories, which can cause a Java heap OOM error.

 

If you need to use the COPY INTO command, consider the following best practices to avoid Java heap OOM errors. 

  • Increase the driver memory allocation in your cluster configuration.
  • Split the source data into smaller directories with fewer files to reduce the number of files processed during the `InMemoryFileIndex` operation.   
  • Design your data layout with partitioning and optimal file sizes to minimize the impact of file listing operations on driver memory.