Converting UTF8 to ASCII file and back to UTF8

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
premreddyb
Participant
Posts: 6
Joined: Thu Jun 10, 2004 3:32 pm

Converting UTF8 to ASCII file and back to UTF8

Post by premreddyb »

Hi,

I have one requirement where I need to convert UTF8 file to ASCII file and then convert it back to UTF8.

Could any one pelase help me how to do this using Datastage.

Regards
BRP
1stpoint
Participant
Posts: 165
Joined: Thu Nov 13, 2003 2:10 pm
Contact:

Post by 1stpoint »

This can be nicely done by writing a Python script to handle the decoding and encoding of the data. DataStage is a data migration tool and is not really designed for this type of encoding/decoding. By writing a script in Python you can ensure that it is platform neutral.

See:
http://www.opendocspublishing.com/pyqt/x2183.htm

and

http://pydoc.org/2.1/encodings.utf_8.html

Best of luck.
premreddyb
Participant
Posts: 6
Joined: Thu Jun 10, 2004 3:32 pm

Post by premreddyb »

Hi,
Can you please expalin me how do I integrate scripts with Datastage.
Do I need run seperate scripts and convert them and use them in my DAtastage.

Regards
Prem


1stpoint wrote:This can be nicely done by writing a Python script to handle the decoding and encoding of the data. DataStage is a data migration tool and is not really designed for this type of encoding/decoding. By writing a script in Python you can ensure that it is platform neutral.

See:
http://www.opendocspublishing.com/pyqt/x2183.htm

and

http://pydoc.org/2.1/encodings.utf_8.html

Best of luck.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do you have NLS (National Language Support) enabled in DataStage?
If so you can use mapping on the inputs and outputs. Internally, if NLS is enabled, DataStage uses an idiosyncratic UTF-8 encoding of Unicode.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwhyman
Premium Member
Premium Member
Posts: 13
Joined: Fri Apr 09, 2004 2:18 am

Post by jwhyman »

There is no need to convert from ASCII to UTF8 by definition ASCII and UTF8 are invariant. 0x00-0x7F is encoded as 0x00-0x7F. This is the reason why it is used.
premreddyb
Participant
Posts: 6
Joined: Thu Jun 10, 2004 3:32 pm

UTF8

Post by premreddyb »

If I use NLS MAP the UTF8 with file which contains Japnese charracters as Input. then in my output of EBCDIC file the characters are replaced by " ?" symbols.

Regards,
Prem
ray.wurlod wrote:Do you have NLS (National Language Support) enabled in DataStage?
If so you can use mapping on the inputs and outputs. Internally, if NLS is enabled, DataStage uses an idiosyncratic UTF-8 encoding of Unicode.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you have Japanese characters in the input you will need the correct Japanese map to translate them when reading the file. There are many different encodings of Japanese characters; sometimes we even find that different columns are encoded differently, or that the map changes during a data stream (triggered by shift-in/shift-out characters).

There is no guarantee that using a different map when writing will magically "translate" the characters into a different encoding. Not all characters are represented in every encoding.

DataStage is not intended as a translation tool.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
premreddyb
Participant
Posts: 6
Joined: Thu Jun 10, 2004 3:32 pm

Post by premreddyb »

Hi Ray,

Could you please pass on some examples if you have.

Regards,
Prem



ray.wurlod wrote:If you have Japanese characters in the input you will need the correct Japanese map to translate them when reading the file. There are many different encodings of Japanese characters; sometimes we even find that different columns are encoded differently, or that the map changes during a data stream (triggered by shift-in/shift-out characters).

There is no guarantee that using a different map when writing will magically "translate" the characters into a different encoding. Not all characters are represented in every encoding.

DataStage is not intended as a translation tool.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I have none. I am not working with Japanese data at the moment.
You might like to ask Ascential support - through your support provider, of course.

It is always a problem to be certain about how Japanese data are encoded. It is rare that the data owner knows for sure. Take a look at the drop-down list of possible mappings to see what I mean.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
1stpoint
Participant
Posts: 165
Joined: Thu Nov 13, 2003 2:10 pm
Contact:

solution

Post by 1stpoint »

We have had this problem in the past and Python will accurately encode and decode the Japanese UTF-8 characters. This is done in a pre-load process either called by a batch or unix shell script. The link above actually has a working UTF8 conversion program and how to implement it.
Post Reply