Chinese name consuming more space than unicode UTF-8 bytes

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mchivuku
Participant
Posts: 21
Joined: Sat Feb 04, 2006 7:38 am

Chinese name consuming more space than unicode UTF-8 bytes

Post by mchivuku »

There is a customer_name field in the source .dat file received. This name is in Chinese sometimes. It's unicode length is 35. I have defined this field as varchar(35) with unicode set in a sequential file stage. NLS is set to UTF-8. I am able to read the chinese characters and load it into the DB2 table.

Issue is that whenever the name is in chinese, it consumes more space than expected. Since this file is a fixed width file, everytime chinese names occur all other columns would be shifted to the right, hence loading wrong data to wrong columns.

Job design :

sequential file stage --->tfm---> DB2 API stage

Kindly please provide your suggestions
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Each character may occupy one, two, three or even four bytes. True fixed-width is almost always impossible (at least where the unit of measurement is "bytes"). Is it possible to obtain the data in delimited format?
Last edited by ray.wurlod on Wed Jun 30, 2010 4:29 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mchivuku
Participant
Posts: 21
Joined: Sat Feb 04, 2006 7:38 am

Post by mchivuku »

ray.wurlod wrote:Each character may occupy one, two, three or even four bytes. True fixed-width is almost always impossible (at least where the unit of measurement is "bytes"). Is it possible to obtain the data in d ...
I am actually unable to read the premium content. But I am able to process the file if it is a delimited one. In fact tried with BOM and it worked too.

Just one clarification please, you mentioned unit of measurement is bytes, so is there an option to specify something like unicode characters if not bytes? Is there no other option except to delimit the file?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No single method - it depends on source. For example, Oracle uses a factor of 3 - you would specify VarChar(105). Try a factor of 2 first - that is, VarChar(70).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mchivuku
Participant
Posts: 21
Joined: Sat Feb 04, 2006 7:38 am

Post by mchivuku »

ray.wurlod wrote:No single method - it depends on source. For example, Oracle uses a factor of 3 - you would specify VarChar(105). Try a factor of 2 first - that is, VarChar(70). ...
Thanks !

I have tried varcha(70),(140) and bigger values too.
Tried nvarchar and nchar too.

Any other options please?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is it possible to obtain the data in delimited format?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mchivuku
Participant
Posts: 21
Joined: Sat Feb 04, 2006 7:38 am

Post by mchivuku »

ray.wurlod wrote:Is it possible to obtain the data in delimited format? ...
Yes. But that would be a change request to the source system and as well I need to convince them that this solution(fixed length file) would not work :-(
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Well, you're convinced. Convince them.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply