It is a pretty common yet often overlooked standard, while export/import database column to/from files (binary or flat) in the UTF8 mode, the VARCHAR(**) columns in TD (or VARCHAR(**) in ORACLE) is always calcuated as ** X 3 bytes regardless the DDL uses Latin or Unicode (ORACLE DDL uses BYTE or CHAR). This is the same rule used in OCI and SQL*Loader as well.
The funny thing is, if you switch from UTF8 mode to UTF16 mode in BTEQ/FEXP/TPT, the estimated column size will be ** X 2 bytes, which is smaller than ** X 3 bytes. This sounds unbelievable, but folks should check it out.
I bet folks all have run into the following error in BTEQ/ODBC/JDBC/TPT sometimes in life, the rows on disk are definitely smaller than 64KB, but you just can't retrieve them without chopping off several columns! Chances are if you change your session charset from UTF8 to UTF16, you may avoid this annoying error in some cases because the estimation for VARCHAR/CHAR columns in UTF16 mode is 33% smaller than UTF8 mode:
[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.
The reality is we need unicode support for multi-language, but only NAME, ADDRESS, DESCRIPTION type of columns need Unicode storage, the rest VARCHAR/CHAR columns, such as CODE or ID, are happy with Latin. A feature request to Teradata team: when come to estimate the row size for client/server communication parcels (BTEQ, FEXP, TPT, ODBC, JDBC), it will be great
-
in UTF8 mode: to use ** X 1 for Latin, to use ** X 3 for Unicode
-
in UTF16 mode: to use ** X 1 for Latin, to use ** X 2 for Unicode
It is a pretty common yet often overlooked standard, while export/import database column to/from files (binary or flat) in the UTF8 mode, the VARCHAR(**) columns in TD (or VARCHAR(**) in ORACLE) is always calcuated as ** X 3 bytes regardless the DDL uses Latin or Unicode (ORACLE DDL uses BYTE or CHAR). This is the same rule used in OCI and SQL*Loader as well.
The funny thing is, if you switch from UTF8 mode to UTF16 mode in BTEQ/FEXP/TPT, the estimated column size will be ** X 2 bytes, which is smaller than ** X 3 bytes. This sounds unbelievable, but folks should check it out.
I bet folks all have run into the following error in BTEQ/ODBC/JDBC/TPT sometimes in life, the rows on disk are definitely smaller than 64KB, but you just can't retrieve them without chopping off several columns! Chances are if you change your session charset from UTF8 to UTF16, you may avoid this annoying error in some cases because the estimation for VARCHAR/CHAR columns in UTF16 mode is 33% smaller than UTF8 mode:
[Error 3577] [SQLState HY000] Row size or Sort Key size overflow.
The reality is we need unicode support for multi-language, but only NAME, ADDRESS, DESCRIPTION type of columns need Unicode storage, the rest VARCHAR/CHAR columns, such as CODE or ID, are happy with Latin. A feature request to Teradata team: when come to estimate the row size for client/server communication parcels (BTEQ, FEXP, TPT, ODBC, JDBC), it will be great