Problem
When working with numbers with greater than 38 digits (high-precision numbers) using Databricks SQL, you notice an aggregation error in the result of a SUM
operation.
Example
A reported aggregated sum is 11111111101111.111111110111111111011112
, but you expect the sum 11111111101111.11111111011111111101111111110111111110
.
The difference arises only in cases where the source system stores values with extremely high precision.
Cause
Apache Spark SQL uses the DECIMAL
type, which has a maximum precision of 38 digits. This means that you can store the total number of digits (both sides of the decimal point combined) is 38.
Solution
Use a Resilient Distributed Dataset (RDD) instead.
Note
Decimal
used in the following code is specific to the decimal
package, which is not Databricks’ or Spark's data type.
Run the following code in a notebook.
from decimal import Decimal, getcontext
#replace the data with your values. These values are examples.
data = [
("1234567890123.12345678901234567890123456789012345678",),
("9876543210987.98765432109876543210987654321098765432",)
# Create the RDD directly (bypassing DataFrame)
rdd = spark.sparkContext.parallelize(data)
# Convert strings to Decimal and reduce to sum
decimal_sum = rdd.map( lambda x: Decimal (x[0])). reduce( lambda a, b: a + b)
# Print the result
print ("High-precision Decimal Sum:", decimal_sum)
The sum of the two values provided as an example displays as the following output.
11111111101111.11111111011111111101111111110111111110