Reading a CSV file in DROPMALFORMED still includes malformed rows in the result

Written by shubham.bhusate

Last published at: November 7th, 2024

Problem

When reading a CSV file in DROPMALFORMED mode with the .schema option specified, functions such as df.count() or df.agg(count('*')).display() still include malformed rows in the returned result. 

 

Cause

The functions df.count() and df.agg(count('*')).display() each count the number of line breaks in a file without fully parsing each row according to the schema. For example, if you pass a string in and the schema is expecting an integer, df.count still counts it. 

 

Solution

Databricks recommends caching the DataFrame after reading it, and then calling the df.count() function. This forces Apache Spark to fully parse the data and apply the DROPMALFORMED mode correctly.