Extract timestamps with precision up to nano seconds from a long column

Create a UDF to extract the nanoseconds from the LongType.

Written by G Yashwanth Kiran

Last published at: April 26th, 2025

Problem

You are working with Apache Spark timestamps when you encounter an issue where the nanosecond precision from datetime columns is not retained. When converting a LongType column to TimestampType, the nanosecond precision is lost, leading to inaccurate timestamps.

 

Cause 

Spark does not natively support TimestampType with nanosecond precision, currently it only supports microsecond precision, which means that directly casting a LongType column containing nanoseconds to TimeStampType results in truncation.

 

Solution

To retain nanosecond precision, you must create a User-Defined Function (UDF) that processes the LongType column and converts it into a properly formatted string representation of the timestamp, keeping the nanosecond component intact.

 

Example code (Scala)

  1. Import the required Libraries:
import java.time.{Instant, ZoneId}
import java.time.format.DateTimeFormatter
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.UserDefinedFunction

 

  1. Define the UDF to format the timestamp with nanosecond precision.
val formatInstantWithNanos: UserDefinedFunction = udf((seconds: Long, nanos: Long) => {
  val instant = Instant.ofEpochSecond(seconds, nanos)
  val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSSSSS")
    .withZone(ZoneId.of("<timezone-of-your-choice>"))
  formatter.format(instant)
})

 

  1. Apply the UDF to the DataFrame.
val convertedDF = source_df.withColumn("<formatted-datetime-column>", 
  formatInstantWithNanos(
    col("<column-with-longType>") / 1000000000L,  // Seconds part
    col("<column-with-longType>") % 1000000000L   // Nanoseconds part
  )
)

 

 

Here, the source_df represents the original data frame that reads the raw data. The <column-with-longType> is the column that has the LongType data from which we want to extract the timestamps with precision up to nanoseconds. Then, we create a new convertedDF DataFrame with the <formatted-datetime-column> column, which has the timestamp with nanosecond precision in a StringType format. 

Further downstream operations on the <formatted-datetime-column> column should be done based on the StringType operations.

  • Use this UDF in Spark when dealing with timestamps stored in LongType to ensure that the nanosecond component is preserved.
  • Instead of converting directly to TimeStampType, store the formatted timestamp as a string and modify downstream applications to work with this format.
  • If further processing is needed, downstream applications can parse the string to retrieve the nanosecond precision timestamp.

 

Avoid casting to TimestampType

If we convert/cast the StringType column to a TimestampType column, the nanosecond precision is lost, as Spark currently doesn’t support precision up to nanoseconds. 

If we create a new column <timestamp-col-casted> and cast the <formatted-datetime-column> column to a TimestampType the nanoseconds are lost.

 

Example code casting to TimestampType

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
var dfWithTimestamp = convertedDF.withColumn("timestamp_col_casted", col("formatted_datetime").cast(TimestampType))
dfWithTimestamp = dfWithTimestamp.select("formatted_datetime","timestamp_col_casted")
dfWithTimestamp.show(false)

 

Example result casting to TimestampType

The results show that the nanosecond precision is lost in the casted column.

+----------------------------+----------------------------+
|formatted_datetime          |timestamp_col_casted        |
+----------------------------+----------------------------+
|2017-07-14 08:10:00.123456789|2017-07-14 08:10:00.123456 |
+----------------------------+----------------------------+

dfWithTimestamp: org.apache.spark.sql.DataFrame = [formatted_datetime: string, timestamp_col_casted: timestamp]
dfWithTimestamp: org.apache.spark.sql.DataFrame = [formatted_datetime: string, timestamp_col_casted: timestamp]