Page 1 of 1

Problem in converting UTF8 Character set to ASCII

Posted: Thu Nov 18, 2004 10:40 pm
by dhletl
Hi,

I am facing problem with UTF8 character set.

I have an input file in UTF8 character set - on reading that file using sequential stage I define the NLS as UTF-8.
There is a join and a transformer stage in my designed job.
Finally am taking a sequential file as output in ASCII mode.
The file obtained does seem to be in AScii mode (checked from Unix) - however it stilll contains few junk characters.

Can you help me resolve this problem.

Thanks
Nitin

Posted: Fri Nov 19, 2004 1:12 am
by ray.wurlod
NLS is not really intended for translation of character sets, except from the various external coding schemes (such as UTF-8, GB2312, BIG5, SHIFT-JIS and so on) to and from DataStage's internal character set, which is an idiosyncratic encoding (called UV-UTF8) of Unicode code points; UV-UTF8 preserves dynamic array delimiter characters 0xF8 through 0xFF as single-byte representations.

That ASCII (or ISO8859, which is a superset of ASCII) are close means that most of the characters work with what you are doing. Can you identify which characters are not being properly mapped, and what the actual "junk characters" are? Knowing this may help in diagnosing what's happening.

Posted: Fri Nov 19, 2004 1:52 am
by dhletl
Ray,

The junk characters coming out (in ascii file) are something like "^Z".

Essentially, I require to read a UTF8 file as source file in one of my job.
Subsequently in the process, all intermittent / temporary staging I want to stick to ascii character set. And I need to generate a final file (after all processing) in UTF8 character set.
Any pointers on this?

Thanks and Regards,
Nitin

Posted: Fri Nov 19, 2004 6:36 am
by Eric
You need to find the Hex or Oct code in the UTF8 file for the junk character. You can then prove if it is an ASCII character or not.

Posted: Fri Nov 19, 2004 7:21 am
by Mike
I think your "^Z" is probably a carriage return (CR). Usually when I see this on a UNIX box, it is because someone did a binary mode transfer of an ASCII file from Windows to UNIX. Line terminators on Windows are CRLF. On UNIX the line terminator is just a LF. Fix it by transferring the file in ASCII mode. In a server job, you could alternatively change properties to "DOS style" line terminators (don't know if this is an option for Parallel jobs though).

Mike

Posted: Fri Nov 19, 2004 3:42 pm
by ray.wurlod
Specifically, Ctrl-Z is the end-of-file marker in DOS. So this is a likely candidate if the data originally came from a Windows system.

It's also a possibility that "UTF-8" on Windows and "UTF-8" on your UNIX aren't exactly the same; there are quite a few UTF-8 encodings out there. You can learn about them from the Unicode Consortium web site, search for "UTF-8".

Posted: Fri Nov 19, 2004 5:37 pm
by Mike
Thanks for the clarification Ray. I just realized that I confused the "^Z" with "^M" (which would appear at the end of every line if it was a line termination problem).

Mike