Problem
When reading Parquet files using spark.read.parquet
, you notice trailing zeros added to decimal values.
Example
A source has data (1,50.0); (2,6.2300); (3,4.56). After reading the data in Parquet format, the values appear with trailing zeros.
id |
value |
1 |
50.0000 |
2 |
6.2300 |
3 |
4.5600 |
Cause
Apache Spark infers the schema for Parquet tables based on the column values and assigns a consistent scale to all decimal values. Trailing zeros appear to the right of the decimal point after all non-zero digits to ensure scale uniformity across the dataset. In the example in the problem statement, trailing zeros were added to achieve a consistent four places after the decimal.
This behavior is by design to maintain consistency and precision when processing decimal data.
Solution
To address the appearance of trailing zeros without altering the underlying data type or precision, use the format_number
function. This function allows you to specify the desired number of decimal places to display.
The line format_number(col("<your-decimal-column>"), 2)
in the following example formats the values in the decimal_column
to two decimal places. Adjust the second parameter (2
) to control the number of decimal places displayed.
Example
from pyspark.sql.functions import format_number, col
df = spark.read.parquet("<your-parquet-file-path>")
df = df.withColumn("<formatted-column>", format_number(col("<your-decimal-column>"), 2))
For more information, review the format_number function (AWS | Azure | GCP) documentation.