How can I remove all special characters

igorbmartins · Post by **igorbmartins** » Mon Sep 13, 2010 6:36 am

Friends, I have one text file and I need to clean it before to load one table. I need to remove all specials characters. There is some way to do this without mapping character by character?

ArndW · Post by **ArndW** » Mon Sep 13, 2010 6:39 am

The answer depends heavily on what you consider to be special characters.

mhester · Post by **mhester** » Mon Sep 13, 2010 6:47 am

This would not be something that I would do within a particular job flow by , but rather outside of the job. If you would want to do it within a job then I would create a parallel routine (pretty easy) and invoke it as a function in the transform.

If you want to handle it outside using something like sed, perl etc... then you could do it in a sequencer and use the command stage and invoke the following command -

Code: Select all

sed "s/[^a-z|0-9]//g;" file1 > file2

You would also have to define what are "special" characters.

ArndW · Post by **ArndW** » Mon Sep 13, 2010 6:50 am

I would not exit out of datastage and use sed unless there were no simple way to do this within the tool. This would depend upon your definition of "special". The base function used for this kind of work is CONVERT()

mhester · Post by **mhester** » Mon Sep 13, 2010 6:55 am

It is not going "outside" of DataStage. It is using the tools on the palette in the sequencer and is pretty standard stuff.

If the OP wants to get rid of all unprintables then something like what I posted will work or a simple C program (of which there are hundreds on the web) will work too and can be called from "within" DataStage.

igorbmartins · Post by **igorbmartins** » Mon Sep 13, 2010 7:18 am

Special Characteres DEC 01 until Dec 31 and the DEC 127. In the following link you can see these characteres http://en.wikipedia.org/wiki/ASCII

Segue a listagem:
Binary --- Oct ----- Dec ----- Hex ----- Abbr ----- [t 1] ----- [t 2] ----- [t 3] ----- Description
000 0000 --- 0 --- 0 --- 0 --- NUL ␀ ^@ \0 Null character
000 0001 --- 1 --- 1 --- 1 --- SOH ␁ ^A Start of Header
000 0010 --- 2 --- 2 --- 2 --- STX ␂ ^B Start of Text
000 0011 --- 3 --- 3 --- 3 --- ETX ␃ ^C End of Text
000 0100 --- 4 --- 4 --- 4 --- EOT ␄ ^D End of Transmission
000 0101 --- 5 --- 5 --- 5 --- ENQ ␅ ^E Enquiry
000 0110 --- 6 --- 6 --- 6 --- ACK ␆ ^F Acknowledgment
000 0111 --- 7 --- 7 --- 7 --- BEL ␇ ^G \a Bell
000 1000 --- 10 --- 8 --- 8 --- BS ␈ ^H \b Backspace[t 4][t 5]
000 1001 --- 11 --- 9 --- 9 --- HT ␉ ^I \t Horizontal Tab[t 6]
000 1010 --- 12 --- 10 --- 0A --- LF ␊ ^J \n Line feed
000 1011 --- 13 --- 11 --- 0B --- VT ␋ ^K \v Vertical Tab
000 1100 --- 14 --- 12 --- 0C --- FF ␌ ^L \f Form feed
000 1101 --- 15 --- 13 --- 0D --- CR ␍ ^M \r Carriage return[t 7]
000 1110 --- 16 --- 14 --- 0E --- SO ␎ ^N Shift Out
000 1111 --- 17 --- 15 --- 0F --- SI ␏ ^O Shift In
001 0000 --- 20 --- 16 --- 10 --- DLE ␐ ^P Data Link Escape
001 0001 --- 21 --- 17 --- 11 --- DC1 ␑ ^Q Device Control 1 (oft. XON)
001 0010 --- 22 --- 18 --- 12 --- DC2 ␒ ^R Device Control 2
001 0011 --- 23 --- 19 --- 13 --- DC3 ␓ ^S Device Control 3 (oft. XOFF)
001 0100 --- 24 --- 20 --- 14 --- DC4 ␔ ^T Device Control 4
001 0101 --- 25 --- 21 --- 15 --- NAK ␕ ^U Negative Acknowledgement
001 0110 --- 26 --- 22 --- 16 --- SYN ␖ ^V Synchronous Idle
001 0111 --- 27 --- 23 --- 17 --- ETB ␗ ^W End of Transmission Block
001 1000 --- 30 --- 24 --- 18 --- CAN ␘ ^X Cancel
001 1001 --- 31 --- 25 --- 19 --- EM ␙ ^Y End of Medium
001 1010 --- 32 --- 26 --- 1A --- SUB ␚ ^Z Substitute
001 1011 --- 33 --- 27 --- 1B --- ESC ␛ ^[ \e[t 8] Escape[t 9]
001 1100 --- 34 --- 28 --- 1C --- FS ␜ ^\ File Separator
001 1101 --- 35 --- 29 --- 1D --- GS ␝ ^] Group Separator
001 1110 --- 36 --- 30 --- 1E --- RS ␞ ^^[t 10] Record Separator
001 1111 --- 37 --- 31 --- 1F --- US ␟ ^_ Unit Separator
111 1111 --- 177 --- 127 --- 7F --- DEL ␡ ^? Delete[t 11][t 5]

ArndW · Post by **ArndW** » Mon Sep 13, 2010 8:12 am

It would seem that you are in a single-byte mode, i.e. no UTF-8 or extended latin character set, which makes things easier.

The sequential file stage allows you to specify a filter, so you could indeed use the follow sed command in the sequential file stage filter condition:

Code: Select all

sed "s/[^space-~]//g;"