Japanese character support in external metastore

Use Japanese characters in tables in an external metastore.

Written by Adam Pavlacka

Last published at: May 16th, 2022

Problem

You are trying to use Japanese characters in your tables, but keep getting errors.

Create a table with the OPTIONS keyword

OPTIONS provides extra metadata to the table. You try creating a table with OPTIONS and specify the charset as utf8mb4.

%sql

CREATE TABLE default.JPN_COLUMN_NAMES('作成年月' string
,'計上年月' string
,'所属コード' string
,'生保代理店コード_8桁' string
,'所属名' string
)
using csv  OPTIONS (path "/mnt/tabledata/testdata/", header "true", delimiter ",", inferSchema "false", ignoreLeadingWhiteSpace "false", ignoreTrailingWhiteSpace "false", multiLine "true", escape "\"" , charset "utf8mb4");

The result is an error.

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:javax.jdo.JDODataStoreException: Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?)

Create a table without the OPTIONS keyword

You try to create a table without using OPTIONS.

%sql

CREATE TABLE test.JPN_COLUMN_NAMES (`作成年月` string ,`計上年月` string) USING csv
describe extended test.JPN_COLUMN_NAMES;

The table appears to be created, but the column names are shown as ???? instead of using the specified Japanese characters.

???? in column names.

Create a table with Hive table expression

You try creating a Hive format table and specify the charset as utf8mb4.

%sql

CREATE TABLE test.JPN_COLUMN_NAMES (`作成年月` string ,`計上年月` string)
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
   WITH SERDEPROPERTIES ( "separatorChar" = ",",
    "quoteChar" = "\"",
    "escapeChar" = "\\",
    "serialization.encoding"='utf8mb4')
    TBLPROPERTIES ( 'store.charset'='utf8mb4',
    'retrieve.charset'='utf8mb4');

The result is an error.

Caused by: java.sql.SQLException: Incorrect string value: '\xE4\xBD\x9C\xE6\x88\x90...' for column 'COLUMN_NAME' at row 1
Query is: INSERT INTO COLUMNS_V2 (CD_ID,COMMENT,`COLUMN_NAME`,TYPE_NAME,INTEGER_IDX) VALUES (6544,<null>,'作成年月','string',0)

Cause

When a table is created, an entry is updated in the Hive metastore. The Hive metastore is typically a MySQL database.

When a new table is created, the names of the columns are inserted into the TABLE_PARAMS of the metastore.

The charset collation of PARAM_VALUE from TABLE_PARAMS is latin1_bin as collation and the charset is latin1.

%scala

executeQuery("""SELECT TABLE_SCHEMA , TABLE_NAME , COLUMN_NAME , COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'TABLE_PARAMS' """)

Latin1_bin collation image.

Solution

latin1 does not have support for Japanese characters, but UTF-8 does.

You need to use an external metastore with UTF-8_bin as collation and the charset as UTF-8.

Any MySQL database 5.6 or above can be used as a Hive metastore.

For this example, we are using MySQL 8.0.13-4.

  1. Create an external Apache Hive metastore (AWS | Azure | GCP).
  2. Create a database to instantiate the new metastore with default tables.
    %sql
    
    create database <database_name>
  3. The newly created tables can be explored in the external database objects browser or by using the show tables command.
    %sql
    
    -- Run in the metastore database.
    show tables in <database_name>
  4. Check the collation information in MySQL at the table level.
    %sql
    
    SELECT TABLE_COLLATION,TABLE_NAME,TABLE_TYPE,TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES where TABLE_TYPE like 'BASE%'
  5. Check the collation information in MySQL at the column level.
    %sql
    
    SELECT TABLE_SCHEMA , TABLE_NAME , COLUMN_NAME , COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS
  6. Change the charset from latin1 to UTF-8.
    %sql
    
    -- Run in the metastore database. All queries are compatible with MySQL.
    -- Change collation and charset across the database.
    ALTER DATABASE <database_name> CHARACTER SET utf8 COLLATE utf8_bin;
    -- Change collation and charset per table.
    ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;
    -- Change collation and charset at the column level.
    ALTER TABLE <table_name> MODIFY <column_name> <datatype> CHARACTER SET utf8 COLLATE utf8_bin;
  7. You can now correctly view Japanese characters when you display the table.
    Japanese characters in column names.