unicode column lengths multiplied by 4

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
roblew
Charter Member
Charter Member
Posts: 123
Joined: Mon Mar 27, 2006 7:32 pm
Location: San Ramon

unicode column lengths multiplied by 4

Post by roblew »

Hello,

We are running DS EE 7.5.1A on Linux with NLS installed. Using the Oracle stage with the "replace" write method, any columns defined with the "unicode" Extended option are created in Oracle with the column length multiplied by 4.

For example, a column defined in the Oracle stage as unicode varchar 10 will be created in Oracle as varchar2 (40). We have the same results with using the nvarchar and nchar datatypes. This also results in jobs reading from the same table and the same table definitions to come back with "truncation may occur" warnings, since the table was created as larger than intended.

Because we have the Oracle parameter NLS_LENGTH_SEMANTICS set to CHAR, which would mean that a varchar2(10 CHAR) can store any multibyte string up to 10 characters long, DataStage insists on creating the column as a multiple of 4 of its intended size. Is there an option to turn off this functionality?

Thanks,
Rob
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

We are on AIX and our numbers are 3 times too big. Is this a bug or a setting problem?
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In server jobs you'll find that display width is double the precision when the data type is NChar or NVarChar.

Internally, DataStage NLS uses its own idiosyncratic UTF-8 encoding of Unicode code points, so as to preserve dynamic array delimter characters as single-byte characters. This means that Unicode characters within DataStage might occupy one, two, three or even four bytes.

I suspect some engineer has spotted that specification and allowed for the max. Without good reason, imho. I am not aware of any way to "switch it off" other than to edit the table definition immediately after it has been imported (before it's used in a job design).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
roblew
Charter Member
Charter Member
Posts: 123
Joined: Mon Mar 27, 2006 7:32 pm
Location: San Ramon

Post by roblew »

ray.wurlod wrote:...edit the table definition immediately after it has been imported (before it's used in a job design).
This would be a simple work-around (which I had already suggested to our developers), except that all of the jobs have the "replace" write method, and so each table will be re-created each time the jobs are run. with over 3000 jobs, my developers are unwilling to modify their jobs.

Anyways, I have submitted this as a new case to IBM, as I see it as a bug, since there is no way to turn off this "feature".
Last edited by roblew on Fri Feb 23, 2007 11:02 am, edited 1 time in total.
roblew
Charter Member
Charter Member
Posts: 123
Joined: Mon Mar 27, 2006 7:32 pm
Location: San Ramon

Post by roblew »

FYI, this problem has not been resolved. However, after working with IBM on this, we've come to the conclusion that we will have to just deal with it. Evidently, the problem no longer exists after upgrading both database and client driver to 10g. To fix the existing problem for 9i, IBM would need to rewrite the Oracle Enterprise stage so that it will use ANSI Dynamic SQL instead of Dynamic SQL, as Oracle recommends.

For our jobs, we have decided to just accept that the stage will create column lengths 4x the length. Then upon reading from those tables, we will ignore the "truncation may occur" warnings. Other work-arounds are not realistic for our situation, due to the number of jobs that would need changes.

I have posted the detailed response from IBM below:
Last edited by roblew on Fri Feb 23, 2007 11:01 am, edited 1 time in total.
roblew
Charter Member
Charter Member
Posts: 123
Joined: Mon Mar 27, 2006 7:32 pm
Location: San Ramon

Post by roblew »

From IBM:
DataStage Oracle 9i Enterprise Stage has limitations with "ustring" (multibyte characters) due to the fact that it was originally written using Dynamic SQL, instead of ANSI Dynamic SQL, that Oracle recommends for all new applications. Here is some technical references:


Ref: Pro*C/C++ Precompiler Programmer's Guide (Release 9.2 Part No. A97269-01)
Above Oracle document says:"Oracle DynamicSQL Method 4 does not support object types, results sets, arrays of structs, or LOBs. ANSI SQL does support all data types. Use ANSI SQL for all new applications."

There at Page: ANSI Dynamic SQL 14-7, we read under heading Oracle Extensions: Support for object types, NCHAR columns, and LOBs.
And Table 14-4 Definitions of Descriptor Item Names for GET DESCRIPTOR mentions:
LENGTH: Length of data in the column: in characters for NCHAR; in bytes otherwise. Set by the DESCRIBE OUTPUT.


Setting the NLS_LENGTH_SEMANTICS to byte or char has not impact with the Oracle9i EE Stage, where with 10G stage if controls the method the column lengths are stored in Oracle table (char vs. byte).

Recommendations:

1) upgrade to Oracle 10G environment to take full advantage of the different column length options (byte vs. char)

2) Ignore the warnings about the different column lengths in the job that extracts data from the tables with too long column lengths (you can set a Message Handler in PX to automatically suppress these warnings)
Post Reply