Character Translator for data files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
saadmirza
Participant
Posts: 76
Joined: Tue Mar 29, 2005 2:57 am

Character Translator for data files

Post by saadmirza »

Hi All,
Can anyone tell me whether Ascential has a tool which would translate any script(Japenese,Thai,Chinese,Arabic,French,Spanish etc.) to English character set.
I have a requirement that files would come in from different countries around the world and we need to build a warehouse in English.
Please advice as to how we can proceed.

Thanks in advance,
SM
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

saadmirza,

The builtin functionality is called 'NLS' and there are hundreds of posts in this forum alone discussing aspects of this.

DataStage will only translate at a character set level, but there are published conversion rules on the internet that explain what the normal latin alphabet representations for all of your incoming data are. You can buy add-on products that will do this.

The really expensive add-ons will also translate words, doing things like translating "Ginko" into "Ginko Tree" when the input is French but translating it into "Bank" when the source is declared as Japanese. DataStage doesn't do anything like this.

If you only need to store the information then use Unicode - this can simultaneously store data in most characters sets of living languages (and many dead ones, as well).
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Be very careful.

DataStage does NOT do translations. DataStage performs character mapping (it's not the same thing). The character set in which external data are encoded - for example GB2312, Shift-JIS, PC936 - is mapped to DataStage's UTF encoding of Unicode on the way in to DataStage, and back to the same external character set on the way out of DataStage.

DataStage can not be used for translation, either between languages or even - reliably - between different external character sets.

The CIA did once develop machine translation. Some of the early attempts were amusing, and they still haven't got idioms. For example, the English saying "out of sight, out of mind" was translated into Russian then back into English - the best way to test these things. The result was "invisible idiot".

Do follow up on the add-ons Arnd suggested, but the better you want the translation to be, the (many) more dollars you're going to have to spend.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Just after posting that I had another thought. DataStage does do one form of translation - localization. There are no text messages in DataStage code; everything has a code. That code is decoded using a "resource" file, for example DS_RESENU for English (US), DS_RESJPN for Japanese. These are hashed files.

You could do something similar with finite translations, such as product names. For example, you could create a hashed file whose key is the Arabic product name and whose other field is the equivalent English product name to be used in the data warehouse.

I'd recommend using separate hashed files for separate languages. While it is possible to have hashed files with a different NLS character mapping in every column, they are cumbersome and inefficient.

Of course, this becomes more difficult with non-finite lists, such as name and addreess data. That's where you have to start spending money. How important is it to translate Homer Simpson and Bart Simpson into Omar Shamshoon and Badr Shamshoon?

Transliteration is also fraught with problems; if you translate the Chinese word in to English characters, do you mean "mother" or "horse" - it's probably important to get this right, but transliteration does not take tonality into account.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply