Page 1 of 1

handling non-english characters that is part of a string

Posted: Mon Feb 04, 2013 1:27 am
by virtusadsuser
Hello,
We are processing non-english character fields (European) through DataStage 8.0 by setting both DB and DS in UTF-8 format. The data is finally getting loaded into SAP.

One open challenge we have here is when applying some transformation to fields having non-English characters. There is a source field "NAME" which is of 40 bytes. But the target stores this in two fields - NAME1 & NAME2
NAME1 is of length 25
NAME2 is of length 25
Since these non-english characters occupy more than 1-byte, when we apply the transformation to split the value in two fields, we are getting some junk characters in NAME2.

Is there a way to check if the character is non-english, and then include that character in the second half of the name if it comes in the 25th position?

One way could be validate by iterating through each character to see if it is part of extended character set through its equivalent hex codes and index it; then move the rest of the characters along with non-English character (that is occuring on the 25th position of source field) to NAME2 because this character actually occupies more than 1 byte.

Can anyone provide some thoughts on this?

Thanks,
Sreeja R

Re: handling non-english characters that is part of a string

Posted: Mon Feb 04, 2013 2:58 am
by srinivas.nettalam
You can search for Convert or Double Convert to get an idea of resolving similar kind of issues.However I am not aware of the part of your question regarding the non-English characters occupying more than 1 byte.

Posted: Mon Feb 04, 2013 7:46 am
by chulett
I'd also point out that when you say "NAME1 is of length 25" you could very well be saying 25 bytes rather than characters. Oracle for example is typically set to "byte semantics", however you haven't mentioned the actual DB here.

Posted: Mon Feb 04, 2013 10:55 am
by eph
Hi,

You string contains 43 characters, whereas it is a 45 bytes string (in utf8 characters are coded on 1 or 2 bytes according to their position in the charset table). It seems that DS is counting the bytes, whereas your DB count in char.
Maybe sybase has a similar command as Oracle's dump, which can show the real size of the string).

Eric

Posted: Mon Feb 04, 2013 11:02 am
by virtusadsuser
Thanks Eric
But DataStage should process strings as characters and not bytes, right? If we set up the NLS at UTF-16 and if there are characters that occupy more than 2 bytes, will DataStage parse a string as characters or as bytes?

Also does this mean the NLS setting of UTF-8 is not set correctly?

We want DataStage also to process the same way as DB? Is that possible?

Regards,
Sreeja R