You create an external table in Athena and integrate it with Delta Lake using the instructions in the Presto and Athena to Delta Lake integration documentation.
The external table in Athena is defined in the Apache Hive metastore.
You run a select query on external table from the Athena Query Editor and it returns a
HIVE_CURSOR_ERROR: Can not read value at 0 in block 0 in file s3://<parquet-file-path>
The root cause of the issue is the different Parquet conventions used in Hive and Apache Spark.
In Spark 1.4 or above, the current Parquet format is used, and decimal values are written as integers. In earlier versions of the Parquet format, decimal values are written in Apache’s fixed-length byte array format. This Parquet format is used by other Apache systems such as Hive and Apache Impala.
As a result, you receive an error when the internal representation of the datatype is different due to using two Parquet formats.
If you are using Athena or Presto to access Delta Lake managed tables, the Parquet files must be created in a format that is compatible with Hive.
You cannot choose the Parquet convention in Hive, but you can do so with Spark.
spark.conf.set("spark.sql.parquet.writeLegacyFormat",True) property at either the cluster level or at notebook level to resolve the issue.
To set the properties at cluster level:
- Edit the cluster properties.
- Click Advanced Options.
- Select Spark.
- Enter the property setting in the Spark Config field.
- Confirm the change.
- Restart the cluster.
Changing the DDL of the table to match the Spark datatype does not return any query results.
spark.conf.set("spark.sql.parquet.writeLegacyFormat",True) must be set.
Please review the Parquet configuration documentation for more information.