Page 1 of 2

Help with using sed, awk, nawk or tr

Posted: Thu Jan 05, 2006 3:20 pm
by mhester
Here's the situation..... I have an input file which contains rows of data that look something like the following -

Code: Select all

[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980
Each field is separated by a "|" delimiter. Each field contains a set of coordinates (not essential) and data. Without using UV BASIC I want to remove all coordinate data from each field of every row of incoming data so the above row when output would look like the following -

Code: Select all

3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980
I would like to do this via unix with either the sed, awk, nawk or tr command or whatever would work.

Any ideas?

Thanks,

Posted: Thu Jan 05, 2006 4:20 pm
by ArndW
Michael,

are you sure you won't relent and use a DS/Basic function to do this? It would only be 9 lines long...

Code: Select all

   StringLen = LEN(Arg1)
   Ans = ''
   Skip = 0
   FOR i = 1 to StringLen
      CurrentChar = Arg1[i,1]
      IF CurrentChar = '[' THEN Skip = 1 
      ELSE IF CurrentChar=']' THEN Skip = 0 
      ELSE IF NOT(Skip) THEN Ans := CurrentChar
   NEXT i
[But I am no good at awk and would welcome learning that method]

Posted: Thu Jan 05, 2006 4:35 pm
by kcbland
I scratched my head and except for a few fleas no memories of switches on given commands came to mind.

The only solution I could think of is a .ksh script to parse each row, loop thru the count of "|" found in each line 1 to x, use cut -d"|" -fx to extract each field and then use cut -d"]" -f2 to take everything after the first ], and concat to a variable and output line at end of loop. That will be dog slow, as cut re-parses the line from beginning on each loop, and the commands are slow anyway.

Have fun.

Posted: Thu Jan 05, 2006 4:42 pm
by mhester
Ken and Arnd,

Thanks!

Both are solid solutions. I had hoped to do it via one of the commands I listed but I do understand that this may not be possible with a simple command.

Thanks again

Posted: Thu Jan 05, 2006 4:52 pm
by mhester
I really thought this question would have solicited a response from the Duke-a-nator!

Come on Kim..... give me the Unix one-line command answer :-)

Posted: Thu Jan 05, 2006 5:29 pm
by ray.wurlod
It's do-able with awk, but not in a single line - you'd need an awk script to loop through the arbitrary (or even fixed) number of sets of coordinates. So you may as well go with any kind of script.

Posted: Thu Jan 05, 2006 5:37 pm
by mhester
I implemented Arnd's solution (thanks Arnd!) and it works wonderfully. I just wanted to broaden my knowledge and do it in a way that I am not so familiar with.

Posted: Thu Jan 05, 2006 8:40 pm
by kduke
I would do like Ken. BASIC is so much easier though. If you did this in a shell script you would need to parse one field at a time using cut -d'|' -fx where x goes to the end of the line.

I think if you where clever then Perl would work because you want * between ] and [ or |. Sed and awk can do all the same.

Posted: Thu Jan 05, 2006 11:03 pm
by djm
On the presumption that the co-ordinates syntax does not appear within the useful data e.g. you don't have a field something like
...|[1,2]blah blah [3,5]blah|...
see whether the following achieves the desired result.
sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g' yourfile > newfile
post the success or otherwise.
David

Posted: Fri Jan 06, 2006 8:26 am
by mhester
D,

Your presumption is correct and your solution worked wonderfully! - Thanks :-)

The following rows of data -

Code: Select all

aa|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
aa|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
bb|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
cc|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ff|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
Now look like -

Code: Select all

aa|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
aa|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
bb|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
cc|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ff|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
Which is what I wanted.

Thanks again!

Posted: Fri Jan 06, 2006 4:42 pm
by ray.wurlod
You obviously need a certain kind of mind to do that kind of sed stuff!

David, it might be nice to explain what the sed script

Code: Select all

's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
is doing, for those who haven't used the utility before. Particularly the need for the escape characters ("\"), the repeater characters ("*") and the global specifier ("g") (those may not be correct terminology).

Posted: Fri Jan 06, 2006 5:31 pm
by djm
Ray, I guess you mean other than "man sed" and "man 5 regexp"?

Basically the bit between the quotes is a command to sed (stream-editor). A breakdown of the particular command issued is as follows:
s/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g

s = substitute command
/ = delimiter for different "arguments" for the subsititue command
\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\] = specifies what to match (see below)
/ = argument delimiter
\1 = what the matching string should be replaced with (see below)
/ = argument delimiter
g = identifies that the substitution should be applid to every matching string in the line rather than just the first
Now the breakdown of the what to match and replacement are:
\( = Identifies the start of something that you want substitute to remember. The \ character "escapes" the ( so that the expression doesn't try to match a (.
|* = match the | character zero or more times (the * does the zero or more times).
\) = the end of the bit you want substitute to remember
\[ = Match the [ character. The [ has to be escaped as [ has special meaning within a regular expression (see next line).
[0-9] = Match one (and only one) character from the range 0 to 9 one time.
[0-9]* =Match one (and only one) character from the range 0 to 9 zero or more times.
, = match the comma character
[0-9] = as above
[0-9]* = as above
\] - Match the ] character. Likewise this has to be escaped as otherwise the ] is interpreted as having special meaning within a regular expression.

For the replacement pattern
\1 = Replace the matching string with the first bit that the substitute command was told to remember i.e. the zero or more occurences of the | character.
So in a more readable form, it said find the | (if there is one) preceeding the bracket enclosed coordinates and replace the | and coordinates with the | character.

On reflection, the sed command should have more robustly been expressed as
sed 's/^\[[0-9]\{1,\},[0-9]\{1,\}\]//g' 's/\(|\)\[[0-9]\{1,\},[0-9]\{1,\}\]/\1/g' yourfile > newfile
I'll leave this an an exercise for people to work out what that does!

David

Posted: Sat Jan 07, 2006 3:26 am
by jzparad
Or conversely, rather than saving the data you want and writing it out, simply get rid of the data you don't want.


Code: Select all

$ sed 's/\[[0-9]*,[0-9]*\]//g' infile
3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980

Posted: Sat Jan 07, 2006 1:52 pm
by djm
Yes. Though searching for coordinates prefixed by the | reduces the likelihood of an unexpected coordinate-like pattern embedded in the data, which is meant to be there, being discarded.

D

Posted: Sat Jan 07, 2006 3:12 pm
by jzparad
You would be right if it were not for the fact that '|*' means zero or more. Therefore, it would still get discarded.

Code: Select all

$ echo "[1,0]3009|[1,1]502"|sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
3009|502
$ echo "[1,0]3009|[1,1][1,1]"|sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
3009|