handling non-english characters that is part of a string

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
virtusadsuser
Premium Member
Premium Member
Posts: 16
Joined: Thu Jan 12, 2012 2:10 am
Location: India

handling non-english characters that is part of a string

Post by virtusadsuser »

Hello,
We are processing non-english character fields (European) through DataStage 8.0 by setting both DB and DS in UTF-8 format. The data is finally getting loaded into SAP.

One open challenge we have here is when applying some transformation to fields having non-English characters. There is a source field "NAME" which is of 40 bytes. But the target stores this in two fields - NAME1 & NAME2
NAME1 is of length 25
NAME2 is of length 25
Since these non-english characters occupy more than 1-byte, when we apply the transformation to split the value in two fields, we are getting some junk characters in NAME2.

Is there a way to check if the character is non-english, and then include that character in the second half of the name if it comes in the 25th position?

One way could be validate by iterating through each character to see if it is part of extended character set through its equivalent hex codes and index it; then move the rest of the characters along with non-English character (that is occuring on the 25th position of source field) to NAME2 because this character actually occupies more than 1 byte.

Can anyone provide some thoughts on this?

Thanks,
Sreeja R
Dream...Dare...Do
srinivas.nettalam
Participant
Posts: 134
Joined: Tue Jun 15, 2010 2:10 am
Location: Bangalore

Re: handling non-english characters that is part of a string

Post by srinivas.nettalam »

You can search for Convert or Double Convert to get an idea of resolving similar kind of issues.However I am not aware of the part of your question regarding the non-English characters occupying more than 1 byte.
N.Srinivas
India.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'd also point out that when you say "NAME1 is of length 25" you could very well be saying 25 bytes rather than characters. Oracle for example is typically set to "byte semantics", however you haven't mentioned the actual DB here.
-craig

"You can never have too many knives" -- Logan Nine Fingers
eph
Premium Member
Premium Member
Posts: 110
Joined: Mon Oct 18, 2010 10:25 am

Post by eph »

Hi,

You string contains 43 characters, whereas it is a 45 bytes string (in utf8 characters are coded on 1 or 2 bytes according to their position in the charset table). It seems that DS is counting the bytes, whereas your DB count in char.
Maybe sybase has a similar command as Oracle's dump, which can show the real size of the string).

Eric
virtusadsuser
Premium Member
Premium Member
Posts: 16
Joined: Thu Jan 12, 2012 2:10 am
Location: India

Post by virtusadsuser »

Thanks Eric
But DataStage should process strings as characters and not bytes, right? If we set up the NLS at UTF-16 and if there are characters that occupy more than 2 bytes, will DataStage parse a string as characters or as bytes?

Also does this mean the NLS setting of UTF-8 is not set correctly?

We want DataStage also to process the same way as DB? Is that possible?

Regards,
Sreeja R
Dream...Dare...Do
Post Reply