Page 1 of 1

Converting UTF8 to ASCII file and back to UTF8

Posted: Fri Jun 11, 2004 11:37 am
by premreddyb
Hi,

I have one requirement where I need to convert UTF8 file to ASCII file and then convert it back to UTF8.

Could any one pelase help me how to do this using Datastage.

Regards
BRP

Posted: Fri Jun 11, 2004 12:17 pm
by 1stpoint
This can be nicely done by writing a Python script to handle the decoding and encoding of the data. DataStage is a data migration tool and is not really designed for this type of encoding/decoding. By writing a script in Python you can ensure that it is platform neutral.

See:
http://www.opendocspublishing.com/pyqt/x2183.htm

and

http://pydoc.org/2.1/encodings.utf_8.html

Best of luck.

Posted: Fri Jun 11, 2004 3:09 pm
by premreddyb
Hi,
Can you please expalin me how do I integrate scripts with Datastage.
Do I need run seperate scripts and convert them and use them in my DAtastage.

Regards
Prem


1stpoint wrote:This can be nicely done by writing a Python script to handle the decoding and encoding of the data. DataStage is a data migration tool and is not really designed for this type of encoding/decoding. By writing a script in Python you can ensure that it is platform neutral.

See:
http://www.opendocspublishing.com/pyqt/x2183.htm

and

http://pydoc.org/2.1/encodings.utf_8.html

Best of luck.

Posted: Fri Jun 11, 2004 7:39 pm
by ray.wurlod
Do you have NLS (National Language Support) enabled in DataStage?
If so you can use mapping on the inputs and outputs. Internally, if NLS is enabled, DataStage uses an idiosyncratic UTF-8 encoding of Unicode.

Posted: Mon Jun 14, 2004 4:07 am
by jwhyman
There is no need to convert from ASCII to UTF8 by definition ASCII and UTF8 are invariant. 0x00-0x7F is encoded as 0x00-0x7F. This is the reason why it is used.

UTF8

Posted: Thu Jun 17, 2004 12:23 pm
by premreddyb
If I use NLS MAP the UTF8 with file which contains Japnese charracters as Input. then in my output of EBCDIC file the characters are replaced by " ?" symbols.

Regards,
Prem
ray.wurlod wrote:Do you have NLS (National Language Support) enabled in DataStage?
If so you can use mapping on the inputs and outputs. Internally, if NLS is enabled, DataStage uses an idiosyncratic UTF-8 encoding of Unicode.

Posted: Thu Jun 17, 2004 4:36 pm
by ray.wurlod
If you have Japanese characters in the input you will need the correct Japanese map to translate them when reading the file. There are many different encodings of Japanese characters; sometimes we even find that different columns are encoded differently, or that the map changes during a data stream (triggered by shift-in/shift-out characters).

There is no guarantee that using a different map when writing will magically "translate" the characters into a different encoding. Not all characters are represented in every encoding.

DataStage is not intended as a translation tool.

Posted: Thu Jun 17, 2004 4:42 pm
by premreddyb
Hi Ray,

Could you please pass on some examples if you have.

Regards,
Prem



ray.wurlod wrote:If you have Japanese characters in the input you will need the correct Japanese map to translate them when reading the file. There are many different encodings of Japanese characters; sometimes we even find that different columns are encoded differently, or that the map changes during a data stream (triggered by shift-in/shift-out characters).

There is no guarantee that using a different map when writing will magically "translate" the characters into a different encoding. Not all characters are represented in every encoding.

DataStage is not intended as a translation tool.

Posted: Thu Jun 17, 2004 4:55 pm
by ray.wurlod
I have none. I am not working with Japanese data at the moment.
You might like to ask Ascential support - through your support provider, of course.

It is always a problem to be certain about how Japanese data are encoded. It is rare that the data owner knows for sure. Take a look at the drop-down list of possible mappings to see what I mean.

solution

Posted: Fri Jun 18, 2004 6:49 am
by 1stpoint
We have had this problem in the past and Python will accurately encode and decode the Japanese UTF-8 characters. This is done in a pre-load process either called by a batch or unix shell script. The link above actually has a working UTF8 conversion program and how to implement it.