XML-Input Stage: wrong UTF-8 encoding

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
stivazzi
Participant
Posts: 52
Joined: Tue May 02, 2006 3:53 am

XML-Input Stage: wrong UTF-8 encoding

Post by stivazzi »

Hi All,
my Server job reads an XML file with correct uft-8 encoded character (i.e. the Trade Mark symbol is correctly encoded with the 3 bytes e284a2). After the XML-Input stage, that split not all data, but some xpaths, I found not correctly character encoded (the same TM symbol is transormed with '1a' character). For this test I used the useful editor fhred to see the encoded data.
I've also tryed to set in the job parameter a user variable called "NLS_LANG" with value 'American_America.WE8ISO8859P1' or 'AMERICAN_AMERICA.UTF8' or 'AMERICAN_AMERICA.WE8MSWIN1252' but seems that the xml stage does not care this variable.

Any help will be apreciated!

Thanks,
Andrea
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The NLS_LANG approach is the correct one. How exactly did you try to 'set it' in your job? You should be setting the value in the Administrator as a User Defined Environment variable, adding it as a parameter to your job and then overriding the default value there.
-craig

"You can never have too many knives" -- Logan Nine Fingers
stivazzi
Participant
Posts: 52
Joined: Tue May 02, 2006 3:53 am

Post by stivazzi »

chulett,
I set the NLS_LANG variable a user defined variable and added it into my server job. The problem seems that after XML-Input stage, the data are not correctly interpreted.
seqFile1(xml)-->XML-Input-->Transformer-->seqFile2

In seqFile1 the TM symbol is correctly encoded with 3 bytes.
In seqFile2 the same symbol is wrongly encoded with only 1 byte.

Thanks,
Andrea
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I've found that I also need to set $LC_CTYPE to C.utf8 to get this to work for me on my server.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply