Handling DBCS/CJK characters

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Handling DBCS/CJK characters

Post by mydsworld »

I am having trouble in viewing file with CJK characters. Please let me know the following :

1. I am able to view the Chinese character in local text editor.In which mode should I FTP the file to DS server (ascii/binary).

2. DS env is NLS enabled. In the DS job, am using Seq file stage to read the file. Which NLS Map is to be used for reading the Chinese data.

Also, please let me know if any other settings need to be done.

Thanks
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

1. What system is the data on? Windows, UNIX? What Character set is defined on the system where you can view the data?

2. If you use a binary transfer from your system, then you need to use the same character set in DataStage as on your system.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

I dont know from which system data was generated. I got the data in xls over e-mail. When I open the xls, I can view those Chinese character.

Code page used in host conversion program is IBM-1386,code page in Excel is GB 2312.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

I dont know from which system data was generated. I got the data in xls over e-mail. When I open the xls, I can view those Chinese character.

Code page used in host conversion program is IBM-1386,code page in Excel is GB 2312.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

So you can view the data on your PC correctly. What is your PC character and how are you copying the file to your UNIX system? Is it a binary FTP? If so, use your PC Character set definition on the UNIX machine.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

I am sending the file in Binary mode over FTP to DS server. When I am viewing the file in Seq File stage, I find the double bytes characters in '???' etc instead of the Chinese characters.

Please advise.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Almost every single NLS thread here on DSXChange which deals with transformation or mapping problems has at least one post that explicitly says not to use the "view data" from the designer to detect or check multibyte character. This thread is now no longer an exception. Use your favorite editor or tool that you know works with DBCS to see if the characters are correct.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

With Ultra edit, I am able to see the remote DS file with Double byte characters, so I assume they are there and due to some 'unknown' issue, 'view data' will not show them.

My job design is like this :

Seq File -> Transformer -> DB2 API

I am using Transformer just to map the file fields to the DB2 table. But I don not find the DB2 table populated with the multi byte values.

In the Seq file I have used fields with 'Varchar' with extended property set, also set the Seq file stage to use NLS Map 'UTF-8'.DB2 API stage is also set in NLS Map 'UTF-8'.

Getting the following warning.

APT_CombinedOperatorController,0: Invalid character(s) ([xAC]) found converting string (code point(s): [x00][x17]S[xAC]N[xAE][x90]?e[x1F][x90][x12][x90]@\ [x00] [x00] [x00]) from codepage UTF-8 to Unicode, substituting.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

With Ultra edit, I am able to see the remote DS file with Double byte characters, so I assume they are there and due to some 'unknown' issue, 'view data' will not show them.

My job design is like this :

Seq File -> Transformer -> DB2 API

I am using Transformer just to map the file fields to the DB2 table. But I don not find the DB2 table populated with the multi byte values.

In the Seq file I have used fields with 'Varchar' with extended property set, also set the Seq file stage to use NLS Map 'UTF-8'.DB2 API stage is also set in NLS Map 'UTF-8'.

Getting the following warning.

APT_CombinedOperatorController,0: Invalid character(s) ([xAC]) found converting string (code point(s): [x00][x17]S[xAC]N[xAE][x90]?e[x1F][x90][x12][x90]@\ [x00] [x00] [x00]) from codepage UTF-8 to Unicode, substituting.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

Couple of other observation.

1. I am able to insert into the DB2 table the Chinese characters (from Toad).

2. The DS job populates DB2 table. But when I view the data in Toad, it is not Chinese.

3. Also what NLS map shd I choose for each stages in the job :

Seq File -> Transformer -> DB2 API
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

first, don't look at the remote file, look at the file on the UNIX box after transfer. If the characters are still correct then you have removed a possible error source. The sequential file read stage will use the project default NLS setting. Assuming this is "UTF-8" it will read this file as if it were UTF-8 (which it isn't) and there you have your source of error. You will need to set the NLS attributes of the stage to the correct character set of the data.
mydsworld
Participant
Posts: 321
Joined: Thu Sep 07, 2006 3:55 am

Post by mydsworld »

Thanks for your advise.

So, how do I determine the character set in the source file. I created the source file manually copying a few lines records with Chinese characters from a master file and then saving it as 'Unicode' encoding.

Also for the DB2 API target stage how to know the character set (that will accept Chinese) of it.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Earlier you indicated that the data was "IBM-1386" which would be simplified Chinese. Why not try using "ibm-1386_P100-2002" in your sequential file Stage -> NLS Map settings?
Post Reply