Data is incorrect when read from Snowflake

Data read from Snowflake is incorrect when time zone value is not set correctly.

Written by DD Sharma

Last published at: May 24th, 2022

Problem

You have a job that is using Apache Spark to read from a Snowflake table, but the time data that appears in the Dataframe is incorrect.

If you run the same query directly on Snowflake, the correct time data is returned.

Cause

The time zone value was not correctly set. A mismatch between the time zone value of the Databricks cluster and Snowflake can result in incorrect time values, as explained in Snowflake’s working with timestamps and time zones documentation.

Solution

Set the time zone in Databricks and do not explicitly set a time zone in Snowflake.

Option 1: Set the time zone for SQL statements in Databricks

  1. Open the Databricks workspace.
  2. Select Clusters.
  3. Select the cluster you want to modify.
  4. Select Edit.
  5. Select Advanced Options.
  6. Enter spark.sql.session.timeZone <timezone> in the Spark config field.
  7. Select Confirm.

Option 2: Set the time zone for all nodes with an init script

  1. Create the init script with the following command:
    %python
    
    dbutils.fs.put("/databricks/scripts/set_timezone.sh","""
    #!/bin/bash
    timedatectl set-timezone America/Los_Angeles
    """, True)
  2. Verify the full path of the init script.
    %python
    
    %fs ls /databricks/scripts/set_timezone.sh
  3. Open the Databricks workspace.
  4. Select Clusters.
  5. Select the cluster you want to modify.
  6. Select Edit.
  7. Select Advanced Options.
  8. Select Init Scripts.
  9. Enter the Init Script Path.
  10. Select Add.
  11. Select Confirm.