Problem
When reading a CSV file in DROPMALFORMED
mode with the .schema
option specified, functions such as df.count()
or df.agg(count('*')).display()
still include malformed rows in the returned result.
Cause
The functions df.count()
and df.agg(count('*')).display()
each count the number of line breaks in a file without fully parsing each row according to the schema. For example, if you pass a string in and the schema is expecting an integer, df.count
still counts it.
Solution
Databricks recommends caching the DataFrame after reading it, and then calling the df.count()
function. This forces Apache Spark to fully parse the data and apply the DROPMALFORMED
mode correctly.