Non English Characters in Fixed length Char field

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

OK.

The Reply to topic link is at the top and bottom of every page. When you are reading this look down a little bit.

And that's why I said to increase it to something larger like 100 and see what ends up in the field.
-craig

"You can never have too many knives" -- Logan Nine Fingers
william.eller@ed.gov
Participant
Posts: 19
Joined: Fri Aug 03, 2012 11:06 am

Re: How to reply and workaround

Post by william.eller@ed.gov »

I checked with other teams at my installation - they've all had the same issue. They performed the research and tried multiple combinations of file types/mappings/extension to no avail. The work around was/is to use a "C" program to read each file, rewrite all recognizable/printable characters to one output file, all others to another thus stripping the non readable characters. I will use this approach

Thanks all ever so much
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You strip out the so-called "unreadable" characters? Throw away the client's data? :shock: That would be a big no-no here. Are you sure you don't want to fix this instead?
-craig

"You can never have too many knives" -- Logan Nine Fingers
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

Our experience is that stripping the characters out isn't the best solution. These characters are created by "non-US" keyboards that are being used by other countries to enter critical data (names, addresses, company names) that contain special characters with diacritical marks. Stripping out the characters just causes frustration as the users see missing characters in names and addresses and, assuming it is a typo, they go put the characters back in.

A better solution is to map the characters to various "anglicized" alternatives without the special diacritical marks. Almost all of them have alternatives that can be used that keep the spelling roughly the same. This "cleanses" the data while letting the users know it wasn't a typo - its just a limitation of the database.

With that said, we use a similar approach with either C or UNIX commands to process the stream of data and replace (instead of remove) the "bad" characters.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
Post Reply