Trailing zeros in decimal values appear when reading Parquet files in Apache Spark

Use the format_number function to format decimal values without altering data precision.

Written by nelavelli.durganagajahnavi

Last published at: December 23rd, 2024

Problem

When reading Parquet files using spark.read.parquet, you notice trailing zeros added to decimal values.

 

Example

A source has data (1,50.0); (2,6.2300); (3,4.56). After reading the data in Parquet format, the values appear with trailing zeros.
 

id

value

1

50.0000

2

6.2300

3

4.5600

 

Cause

Apache Spark infers the schema for Parquet tables based on the column values and assigns a consistent scale to all decimal values. Trailing zeros appear to the right of the decimal point after all non-zero digits to ensure scale uniformity across the dataset. In the example in the problem statement, trailing zeros were added to achieve a consistent four places after the decimal. 

This behavior is by design to maintain consistency and precision when processing decimal data.

 

Solution

To address the appearance of trailing zeros without altering the underlying data type or precision, use the format_number function. This function allows you to specify the desired number of decimal places to display.

The line format_number(col("<your-decimal-column>"), 2) in the following example formats the values in the decimal_column to two decimal places. Adjust the second parameter (2) to control the number of decimal places displayed.

 

Example

from pyspark.sql.functions import format_number, col
df = spark.read.parquet("<your-parquet-file-path>")
df = df.withColumn("<formatted-column>", format_number(col("<your-decimal-column>"), 2))

 

For more information, review the format_number function (AWSAzureGCP) documentation.