Chinese name consuming more space than unicode UTF-8 bytes

mchivuku · Post by **mchivuku** » Wed Jun 30, 2010 4:01 am

There is a customer_name field in the source .dat file received. This name is in Chinese sometimes. It's unicode length is 35. I have defined this field as varchar(35) with unicode set in a sequential file stage. NLS is set to UTF-8. I am able to read the chinese characters and load it into the DB2 table.

Issue is that whenever the name is in chinese, it consumes more space than expected. Since this file is a fixed width file, everytime chinese names occur all other columns would be shifted to the right, hence loading wrong data to wrong columns.

Job design :

sequential file stage --->tfm---> DB2 API stage

Kindly please provide your suggestions

ray.wurlod · Post by **ray.wurlod** » Wed Jun 30, 2010 4:28 am

Each character may occupy one, two, three or even four bytes. True fixed-width is almost always impossible (at least where the unit of measurement is "bytes"). Is it possible to obtain the data in delimited format?

mchivuku · Post by **mchivuku** » Wed Jun 30, 2010 6:00 am

ray.wurlod wrote:Each character may occupy one, two, three or even four bytes. True fixed-width is almost always impossible (at least where the unit of measurement is "bytes"). Is it possible to obtain the data in d ...

I am actually unable to read the premium content. But I am able to process the file if it is a delimited one. In fact tried with BOM and it worked too.

Just one clarification please, you mentioned unit of measurement is bytes, so is there an option to specify something like unicode characters if not bytes? Is there no other option except to delimit the file?

ray.wurlod · Post by **ray.wurlod** » Wed Jun 30, 2010 4:30 pm

No single method - it depends on source. For example, Oracle uses a factor of 3 - you would specify VarChar(105). Try a factor of 2 first - that is, VarChar(70).

mchivuku · Post by **mchivuku** » Wed Jun 30, 2010 9:41 pm

ray.wurlod wrote:No single method - it depends on source. For example, Oracle uses a factor of 3 - you would specify VarChar(105). Try a factor of 2 first - that is, VarChar(70). ...

Thanks !

I have tried varcha(70),(140) and bigger values too.
Tried nvarchar and nchar too.

Any other options please?

ray.wurlod · Post by **ray.wurlod** » Wed Jun 30, 2010 9:49 pm

Is it possible to obtain the data in delimited format?

mchivuku · Post by **mchivuku** » Wed Jun 30, 2010 10:23 pm

ray.wurlod wrote:Is it possible to obtain the data in delimited format? ...

Yes. But that would be a change request to the source system and as well I need to convince them that this solution(fixed length file) would not work

ray.wurlod · Post by **ray.wurlod** » Thu Jul 01, 2010 12:44 am

Well, you're convinced. Convince them.