Problem
When using the following code to read a CSV file as a DataFrame, you notice multiple identical files being written to badRecordsPath
instead of just one file and you cannot see information on which malformed records are present in the table.
df_reader = (
spark.read.format("com.databricks.spark.csv") # Specifies the format as CSV using Databricks' CSV reader
.schema(<your-schema>) # Applies a predefined schema to the DataFrame
.option("header", "true" if headers is None else "false") # Sets the header option based on the presence of headers
.option("delimiter", sep) # Specifies the delimiter used in the CSV file
.option("badRecordsPath", <your-bad-records-path>) # Specifies a path to store records that are malformed or corrupt
)
df = df_reader.load(<your-filepath>) # Loads the CSV file from the specified filepath into a DataFrame
Cause
Although badRecordsPath
is used to specify a location where records are stored that do not conform to the expected schema or encounter errors during processing, badRecordsPath
does not make transaction guarantees.
Solution
Use the .option("mode", "PERMISSIVE")
setting to configure the CSV reader to handle schema mismatches and malformed records. The following code provides an example.
df = (
spark.read.format("csv")
.schema(<your-schema>) # Assuming 'schema' is defined in the context
.option("header", "false")
.option("mode", "PERMISSIVE")
.option("rescuedDataColumn", "<your-column-name-for-rescued-data>") # Specify a different column name
)
df = df_reader.load(your-filepath) # Loads the CSV file from the specified filepath into a DataFrame
For more information, refer to the Read CSV files (AWS | Azure | GCP) documentation.
Additional benefits
Using .option("mode", "PERMISSIVE")
has two additional benefits.
- Any fields in the CSV file that do not match the predefined schema are captured in a special column called
_rescued_data
. This column contains a JSON blob with the mismatched fields and the source file path of the record. This allows you to inspect and handle these mismatched fields separately without losing any data. - The reader does not drop the entire row if there are schema mismatches. Instead, it inserts null for fields that could not be parsed correctly and captures the mismatched data in the
_rescued_data column
. This ensures that you do not lose any records due to schema mismatches, and you can later review and correct the data as needed.