Help with using sed, awk, nawk or tr

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Help with using sed, awk, nawk or tr

Post by mhester »

Here's the situation..... I have an input file which contains rows of data that look something like the following -

Code: Select all

[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980
Each field is separated by a "|" delimiter. Each field contains a set of coordinates (not essential) and data. Without using UV BASIC I want to remove all coordinate data from each field of every row of incoming data so the above row when output would look like the following -

Code: Select all

3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980
I would like to do this via unix with either the sed, awk, nawk or tr command or whatever would work.

Any ideas?

Thanks,
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Michael,

are you sure you won't relent and use a DS/Basic function to do this? It would only be 9 lines long...

Code: Select all

   StringLen = LEN(Arg1)
   Ans = ''
   Skip = 0
   FOR i = 1 to StringLen
      CurrentChar = Arg1[i,1]
      IF CurrentChar = '[' THEN Skip = 1 
      ELSE IF CurrentChar=']' THEN Skip = 0 
      ELSE IF NOT(Skip) THEN Ans := CurrentChar
   NEXT i
[But I am no good at awk and would welcome learning that method]
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I scratched my head and except for a few fleas no memories of switches on given commands came to mind.

The only solution I could think of is a .ksh script to parse each row, loop thru the count of "|" found in each line 1 to x, use cut -d"|" -fx to extract each field and then use cut -d"]" -f2 to take everything after the first ], and concat to a variable and output line at end of loop. That will be dog slow, as cut re-parses the line from beginning on each loop, and the commands are slow anyway.

Have fun.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Ken and Arnd,

Thanks!

Both are solid solutions. I had hoped to do it via one of the commands I listed but I do understand that this may not be possible with a simple command.

Thanks again
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

I really thought this question would have solicited a response from the Duke-a-nator!

Come on Kim..... give me the Unix one-line command answer :-)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's do-able with awk, but not in a single line - you'd need an awk script to loop through the arbitrary (or even fixed) number of sets of coordinates. So you may as well go with any kind of script.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

I implemented Arnd's solution (thanks Arnd!) and it works wonderfully. I just wanted to broaden my knowledge and do it in a way that I am not so familiar with.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I would do like Ken. BASIC is so much easier though. If you did this in a shell script you would need to parse one field at a time using cut -d'|' -fx where x goes to the end of the line.

I think if you where clever then Perl would work because you want * between ] and [ or |. Sed and awk can do all the same.
Mamu Kim
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

On the presumption that the co-ordinates syntax does not appear within the useful data e.g. you don't have a field something like
...|[1,2]blah blah [3,5]blah|...
see whether the following achieves the desired result.
sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g' yourfile > newfile
post the success or otherwise.
David
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

D,

Your presumption is correct and your solution worked wonderfully! - Thanks :-)

The following rows of data -

Code: Select all

aa|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
aa|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
bb|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
cc|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ff|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
Now look like -

Code: Select all

aa|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
aa|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
bb|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
cc|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ff|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
Which is what I wanted.

Thanks again!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You obviously need a certain kind of mind to do that kind of sed stuff!

David, it might be nice to explain what the sed script

Code: Select all

's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
is doing, for those who haven't used the utility before. Particularly the need for the escape characters ("\"), the repeater characters ("*") and the global specifier ("g") (those may not be correct terminology).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

Ray, I guess you mean other than "man sed" and "man 5 regexp"?

Basically the bit between the quotes is a command to sed (stream-editor). A breakdown of the particular command issued is as follows:
s/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g

s = substitute command
/ = delimiter for different "arguments" for the subsititue command
\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\] = specifies what to match (see below)
/ = argument delimiter
\1 = what the matching string should be replaced with (see below)
/ = argument delimiter
g = identifies that the substitution should be applid to every matching string in the line rather than just the first
Now the breakdown of the what to match and replacement are:
\( = Identifies the start of something that you want substitute to remember. The \ character "escapes" the ( so that the expression doesn't try to match a (.
|* = match the | character zero or more times (the * does the zero or more times).
\) = the end of the bit you want substitute to remember
\[ = Match the [ character. The [ has to be escaped as [ has special meaning within a regular expression (see next line).
[0-9] = Match one (and only one) character from the range 0 to 9 one time.
[0-9]* =Match one (and only one) character from the range 0 to 9 zero or more times.
, = match the comma character
[0-9] = as above
[0-9]* = as above
\] - Match the ] character. Likewise this has to be escaped as otherwise the ] is interpreted as having special meaning within a regular expression.

For the replacement pattern
\1 = Replace the matching string with the first bit that the substitute command was told to remember i.e. the zero or more occurences of the | character.
So in a more readable form, it said find the | (if there is one) preceeding the bracket enclosed coordinates and replace the | and coordinates with the | character.

On reflection, the sed command should have more robustly been expressed as
sed 's/^\[[0-9]\{1,\},[0-9]\{1,\}\]//g' 's/\(|\)\[[0-9]\{1,\},[0-9]\{1,\}\]/\1/g' yourfile > newfile
I'll leave this an an exercise for people to work out what that does!

David
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

Or conversely, rather than saving the data you want and writing it out, simply get rid of the data you don't want.


Code: Select all

$ sed 's/\[[0-9]*,[0-9]*\]//g' infile
3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980
Jim Paradies
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

Yes. Though searching for coordinates prefixed by the | reduces the likelihood of an unexpected coordinate-like pattern embedded in the data, which is meant to be there, being discarded.

D
jzparad
Charter Member
Charter Member
Posts: 151
Joined: Thu Apr 01, 2004 9:37 pm

Post by jzparad »

You would be right if it were not for the fact that '|*' means zero or more. Therefore, it would still get discarded.

Code: Select all

$ echo "[1,0]3009|[1,1]502"|sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
3009|502
$ echo "[1,0]3009|[1,1][1,1]"|sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
3009|
Jim Paradies
Post Reply