Multiple identical files being written to badRecordsPath instead of just one file when writing code to read a CSV file as a DataFrame

Use .option(“mode”, “PERMISSIVE”) instead.

Last published at: March 27th, 2025

Problem

When using the following code to read a CSV file as a DataFrame, you notice multiple identical files being written to badRecordsPath instead of just one file and you cannot see information on which malformed records are present in the table.

df_reader = (
  spark.read.format("com.databricks.spark.csv") # Specifies the format as CSV using Databricks' CSV reader
  .schema(<your-schema>) # Applies a predefined schema to the DataFrame
  .option("header", "true" if headers is None else "false") # Sets the header option based on the presence of headers
  .option("delimiter", sep) # Specifies the delimiter used in the CSV file
  .option("badRecordsPath", <your-bad-records-path>) # Specifies a path to store records that are malformed or corrupt
)

df = df_reader.load(<your-filepath>) # Loads the CSV file from the specified filepath into a DataFrame

Cause

Although badRecordsPath is used to specify a location where records are stored that do not conform to the expected schema or encounter errors during processing, badRecordsPath does not make transaction guarantees.

Solution

Use the .option("mode", "PERMISSIVE") setting to configure the CSV reader to handle schema mismatches and malformed records. The following code provides an example.

df = (
    spark.read.format("csv")
    .schema(<your-schema>)  # Assuming 'schema' is defined in the context
    .option("header", "false")
    .option("mode", "PERMISSIVE")
    .option("rescuedDataColumn", "<your-column-name-for-rescued-data>")  # Specify a different column name
)
    df = df_reader.load(your-filepath) # Loads the CSV file from the specified filepath into a DataFrame

For more information, refer to the Read CSV files (AWS | Azure | GCP) documentation.

Additional benefits

Using .option("mode", "PERMISSIVE") has two additional benefits.

Any fields in the CSV file that do not match the predefined schema are captured in a special column called _rescued_data. This column contains a JSON blob with the mismatched fields and the source file path of the record. This allows you to inspect and handle these mismatched fields separately without losing any data.
The reader does not drop the entire row if there are schema mismatches. Instead, it inserts null for fields that could not be parsed correctly and captures the mismatched data in the _rescued_data column. This ensures that you do not lose any records due to schema mismatches, and you can later review and correct the data as needed.

Databricks Help Center

Problem

Cause

Solution

Additional benefits

Contact Us