Problem
While reading a large XML file from any storage location in Databricks, you notice slow performance.
Cause
No schema is provided. Without a provided schema, Apache Spark must scan the file to infer the schema.
With XML files, schema inference is not splittable and scanning is not parallelizable. Processing the entire file and reading it are conducted as a single task, which takes longer.
Solution
Define the schema explicitly when reading XML files by adding it to the read operation parameters. For example, schema("content STRING,item_id long")
.
Alternatively, use Auto Loader for file ingestion, which caches the schema after the first inference to avoid repeated overhead. For more information, refer to the “Schema inference and evolution in Auto Loader” section of the Read and write XML files (AWS | Azure | GCP) documentation.