Column drift when reading multiple delimited files

Ensure that all files being processed together have the same schema.

Written by lakshay.goel

Last published at: September 23rd, 2024

Problem 

You notice column drift while reading multiple delimited files in a single spark.read operation. This problem manifests as columns being incorrectly mapped, leading to data integrity issues. 

Example 

spark.read.format("csv").load(<source-directory>/*)

Where source-directory contains multiple CSV files.

Cause

When multiple files with different schemas are read together, Databricks infers the schema from a sample of records. 

Example

You have 10 files in a source directory with 2 columns, column A and column B. Some files have 'column A' as the first column, while others have 'column B' as the first column in the schema. 

If the schema is inferred from files with 'column A' as the first column, this will cause files with 'column B' as the first column to be mapped incorrectly. This issue is expected behavior when there is a schema difference among the source files.

Solution

  • Ensure that all files being processed together have the same schema. You can standardize the source files’ schema before processing. 
  • If processing multiple files with different schemas is unavoidable, process each file individually to avoid schema inference issues.
  • Regularly monitor and validate the schema of the source files to ensure consistency.