Using datetime values in Spark 3.0 and above

How to correctly use datetime functions in Spark SQL with Databricks runtime 7.3 LTS and above.

Written by deepak.bhutada

Last published at: October 26th, 2022

Problem

You are migrating jobs from unsupported clusters running Databricks Runtime 6.6 and below with Apache Spark 2.4.5 and below to clusters running a current version of the Databricks Runtime.

If your jobs and/or notebooks process date conversions, they may fail with a SparkUpgradeException error message after running them on upgraded clusters.

Error in SQL statement: SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'YYYY-MM-DD' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Cause

Prior to Spark 3.0, Spark used a combination of the Julian and Gregorian calendars. For dates before 1582, Spark used the Julian calendar. For dates after 1582, Spark used the Gregorian calendar.

In Spark 3.0 and above, Spark uses the Proleptic Gregorian calendar. This calendar is also used by other systems such as Apache Arrow, Pandas, and R.

The calendar usage is inherited from the legacy java.sql.Date API, which was superseded in Java 8 by java.time.LocalDate and uses the Proleptic Gregorian calendar.

Solution

You should update your DateTime references so they are compatible with Spark 3.0 and above.

For example, if you try to parse a date in the format YYYY-MM-DD, it returns an error in Spark 3.0 and above. 

select TO_DATE('2017-01-01', 'YYYY-MM-DD') as date

Using the format yyyy-MM-dd works correctly in Spark 3.0 and above.

select TO_DATE('2017-01-01', 'yyyy-MM-dd') as date

The difference in capitalization may appear minor, but to Spark, D references the day-of-year, while d references the day-of-month when used in a DateTime function.

Review all of the defined Spark DateTime patterns for formatting and parsing for more details.

Delete

Info

If you want to temporarily revert to Spark 2.x DateTime formatting, you can set spark.sql.legacy.timeParserPolicy to LEGACY in a notebook. You can also set this value in the cluster's Spark config (AWS | Azure | GCP).

While this option works, it is only recommended as a temporary workaround.



Was this article helpful?