XML file read executes slowly

Define the schema explicitly.

Last published at: June 9th, 2025

Problem

While reading a large XML file from any storage location in Databricks, you notice slow performance.

Cause

No schema is provided. Without a provided schema, Apache Spark must scan the file to infer the schema.

With XML files, schema inference is not splittable and scanning is not parallelizable. Processing the entire file and reading it are conducted as a single task, which takes longer.

Solution

Define the schema explicitly when reading XML files by adding it to the read operation parameters. For example, schema("content STRING,item_id long").

Alternatively, use Auto Loader for file ingestion, which caches the schema after the first inference to avoid repeated overhead. For more information, refer to the “Schema inference and evolution in Auto Loader” section of the Read and write XML files (AWS | Azure | GCP) documentation.

Databricks Help Center

Problem

Cause

Solution

Contact Us