Help with using sed, awk, nawk or tr

mhester · Post by **mhester** » Thu Jan 05, 2006 3:20 pm

Here's the situation..... I have an input file which contains rows of data that look something like the following -

[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980

Each field is separated by a "|" delimiter. Each field contains a set of coordinates (not essential) and data. Without using UV BASIC I want to remove all coordinate data from each field of every row of incoming data so the above row when output would look like the following -

Code: Select all

3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980

I would like to do this via unix with either the sed, awk, nawk or tr command or whatever would work.

Any ideas?

Thanks,

ArndW · Post by **ArndW** » Thu Jan 05, 2006 4:20 pm

Michael,

are you sure you won't relent and use a DS/Basic function to do this? It would only be 9 lines long...

Code: Select all

   StringLen = LEN(Arg1)
   Ans = ''
   Skip = 0
   FOR i = 1 to StringLen
      CurrentChar = Arg1[i,1]
      IF CurrentChar = '[' THEN Skip = 1 
      ELSE IF CurrentChar=']' THEN Skip = 0 
      ELSE IF NOT(Skip) THEN Ans := CurrentChar
   NEXT i

[But I am no good at awk and would welcome learning that method]

kcbland · Post by **kcbland** » Thu Jan 05, 2006 4:35 pm

I scratched my head and except for a few fleas no memories of switches on given commands came to mind.

The only solution I could think of is a .ksh script to parse each row, loop thru the count of "|" found in each line 1 to x, use cut -d"|" -fx to extract each field and then use cut -d"]" -f2 to take everything after the first ], and concat to a variable and output line at end of loop. That will be dog slow, as cut re-parses the line from beginning on each loop, and the commands are slow anyway.

Have fun.

mhester · Post by **mhester** » Thu Jan 05, 2006 4:42 pm

Ken and Arnd,

Thanks!

Both are solid solutions. I had hoped to do it via one of the commands I listed but I do understand that this may not be possible with a simple command.

Thanks again

mhester · Post by **mhester** » Thu Jan 05, 2006 4:52 pm

I really thought this question would have solicited a response from the Duke-a-nator!

Come on Kim..... give me the Unix one-line command answer

ray.wurlod · Post by **ray.wurlod** » Thu Jan 05, 2006 5:29 pm

It's do-able with awk, but not in a single line - you'd need an awk script to loop through the arbitrary (or even fixed) number of sets of coordinates. So you may as well go with any kind of script.

mhester · Post by **mhester** » Thu Jan 05, 2006 5:37 pm

I implemented Arnd's solution (thanks Arnd!) and it works wonderfully. I just wanted to broaden my knowledge and do it in a way that I am not so familiar with.

kduke · Post by **kduke** » Thu Jan 05, 2006 8:40 pm

I would do like Ken. BASIC is so much easier though. If you did this in a shell script you would need to parse one field at a time using cut -d'|' -fx where x goes to the end of the line.

I think if you where clever then Perl would work because you want * between ] and [ or |. Sed and awk can do all the same.

djm · Post by **djm** » Thu Jan 05, 2006 11:03 pm

On the presumption that the co-ordinates syntax does not appear within the useful data e.g. you don't have a field something like

...|[1,2]blah blah [3,5]blah|...

see whether the following achieves the desired result.

sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g' yourfile > newfile

post the success or otherwise.
David

mhester · Post by **mhester** » Fri Jan 06, 2006 8:26 am

D,

Your presumption is correct and your solution worked wonderfully! - Thanks

The following rows of data -

Code: Select all

aa|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
aa|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
bb|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
cc|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
dd|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ee|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|
ff|[1,0]3009|[1,1]502|[1,2]PI Svc Timeliness|[1,3]2005|[1,4]11|[1,5]AF|[1,6]EPIR|[1,7]New Business|[1,8]10|[1,9]10|[1,10]1000|[1,11]980|

Now look like -

Code: Select all

aa|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
aa|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
bb|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
cc|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
dd|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ee|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|
ff|3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980|

Which is what I wanted.

Thanks again!

ray.wurlod · Post by **ray.wurlod** » Fri Jan 06, 2006 4:42 pm

You obviously need a certain kind of mind to do that kind of sed stuff!

David, it might be nice to explain what the sed script

Code: Select all

's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'

is doing, for those who haven't used the utility before. Particularly the need for the escape characters ("\"), the repeater characters ("*") and the global specifier ("g") (those may not be correct terminology).

djm · Post by **djm** » Fri Jan 06, 2006 5:31 pm

Ray, I guess you mean other than "man sed" and "man 5 regexp"?

Basically the bit between the quotes is a command to sed (stream-editor). A breakdown of the particular command issued is as follows:

s/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g

s = substitute command
/ = delimiter for different "arguments" for the subsititue command
\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\] = specifies what to match (see below)
/ = argument delimiter
\1 = what the matching string should be replaced with (see below)
/ = argument delimiter
g = identifies that the substitution should be applid to every matching string in the line rather than just the first

Now the breakdown of the what to match and replacement are:

\( = Identifies the start of something that you want substitute to remember. The \ character "escapes" the ( so that the expression doesn't try to match a (.
|* = match the | character zero or more times (the * does the zero or more times).
\) = the end of the bit you want substitute to remember
\[ = Match the [ character. The [ has to be escaped as [ has special meaning within a regular expression (see next line).
[0-9] = Match one (and only one) character from the range 0 to 9 one time.
[0-9]* =Match one (and only one) character from the range 0 to 9 zero or more times.
, = match the comma character
[0-9] = as above
[0-9]* = as above
\] - Match the ] character. Likewise this has to be escaped as otherwise the ] is interpreted as having special meaning within a regular expression.

For the replacement pattern
\1 = Replace the matching string with the first bit that the substitute command was told to remember i.e. the zero or more occurences of the | character.

So in a more readable form, it said find the | (if there is one) preceeding the bracket enclosed coordinates and replace the | and coordinates with the | character.

On reflection, the sed command should have more robustly been expressed as

sed 's/^\[[0-9]\{1,\},[0-9]\{1,\}\]//g' 's/\(|\)\[[0-9]\{1,\},[0-9]\{1,\}\]/\1/g' yourfile > newfile

I'll leave this an an exercise for people to work out what that does!

David

jzparad · Post by **jzparad** » Sat Jan 07, 2006 3:26 am

Or conversely, rather than saving the data you want and writing it out, simply get rid of the data you don't want.

Code: Select all

$ sed 's/\[[0-9]*,[0-9]*\]//g' infile
3009|502|PI Svc Timeliness|2005|11|AF|EPIR|New Business|10|10|1000|980

djm · Post by **djm** » Sat Jan 07, 2006 1:52 pm

Yes. Though searching for coordinates prefixed by the | reduces the likelihood of an unexpected coordinate-like pattern embedded in the data, which is meant to be there, being discarded.

D

jzparad · Post by **jzparad** » Sat Jan 07, 2006 3:12 pm

You would be right if it were not for the fact that '|*' means zero or more. Therefore, it would still get discarded.

Code: Select all

$ echo "[1,0]3009|[1,1]502"|sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
3009|502
$ echo "[1,0]3009|[1,1][1,1]"|sed 's/\(|*\)\[[0-9][0-9]*,[0-9][0-9]*\]/\1/g'
3009|