PSV file has quote in the text body

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
supernova2005
Participant
Posts: 37
Joined: Mon Jan 24, 2005 10:12 am

PSV file has quote in the text body

Post by supernova2005 »

Good evening everyone,

I have a simple parallel job that reads a PSV file, that is, Pipe delimited file with double quotes, here is a sample

"aaa"|"bbb"|"ccc"|"dddd"
"eeee"|"fffff"|"ggggggg"|"hhhhhhh"


but the job skipped records, and when I inspect the file, it looks like some of the fields have quotes in the text itself, like this

"aaaaaa"|"bbbbbbbbb"|"ccccc "ccc" c"|"ddddddddd"

and wherever there are extract quotes in the text, the sequential file stage skips the entire row, can anyone tell me how to configure the squential file stage to handel this?

thank you in advance!
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

You can't.

That's the challenge with delimited files. You need to choose a delimiter that will not appear in the data. If there is a chance that the delimiter will appear in the data you need to quote it (and then the quote character can not appear in the data).

You're SOL when both the delimiter and quote character appear in the data.

Might be an interesting exercise in 8.5 to see if the new looping functionality might be leveraged to solve this common problem.

Your file needs to be scrubbed. I would try cleaning it with a server job or some kind of script (awk/sed/perl).

Mike
supernova2005
Participant
Posts: 37
Joined: Mon Jan 24, 2005 10:12 am

Post by supernova2005 »

Is it possible to have "|" as the delimiter string? I tried that but the sequential file stage does not take it.

This is a little bit strange because I have no problem parsing the file using Excel, by doing a text to column.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I would think that you could tell it that your fields are not quoted so they come in as data and then you can trim off only the leading/trailing quotes.
-craig

"You can never have too many knives" -- Logan Nine Fingers
supernova2005
Participant
Posts: 37
Joined: Mon Jan 24, 2005 10:12 am

Post by supernova2005 »

chulett wrote:I would think that you could tell it that your fields are not quoted so they come in as data and then you can trim off only the leading/trailing quotes. ...
Craig,

That is what I thought, but it looks like the sequential file stage does not take "|" as a delimit string. whenever I hit ok, and open the sequential file stage again, it is back to |, I guess " is a special character and I probably need a special way to represent it.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not what I meant. Leave the field delimiter as a pipe and tell the stage that string fields are not quoted so all quotes come in as data.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

That'll work as long as the pipe character does not appear in your data. When it does you'll reject those records for having too many fields.

Mike
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

If you want to try your luck with a 3-character delimiter like "|" (might work if you don't have a pipe at the start or end of your record), then you need to use the delimiter string property. The delimiter character property is limited to a single character.

Mike
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

And then you'll need to deal with the leading quotes in the first field and the trailing ones in the last... but some combination of the above should work. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Use a server job (or a server Sequential File stage in a server shared container). This is much better at handling delimiter character inside quoted text than is its parallel equivalent.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
supernova2005
Participant
Posts: 37
Joined: Mon Jan 24, 2005 10:12 am

Post by supernova2005 »

Mike wrote:If you want to try your luck with a 3-character delimiter like "|" (might work if you don't have a pipe at the start or end of your record), then you need to use the delimiter string property. The delimiter character property is limited to a single character.

Mike
Mike,

I did use delimiter string stead of delimiter character, but it seems that the sequential file stage doesn't understand what "|" means, maybe " is a special character or something.

Thanks.
supernova2005
Participant
Posts: 37
Joined: Mon Jan 24, 2005 10:12 am

Post by supernova2005 »

This is my solution in the end, I wrote a simple shell script to remove all quotes in the incoming directory, then use to sequential file stage to read all *.new files, after jobs are completed, remove all *.new files.

Per suggestion from Ray, yes, Server may work, but I want to use the "file pattern" option in PX so I decided to use a parallel job instead.

this is the extra script I wrote.

#!/bin/ksh

input_file_path=$1

cd $input_file_path
for file in * ; do

cat $file | sed s/'"'/''/g >$file.new
done
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

... which means there are no 'extra' pipes in your file and thus my suggestion should have worked just fine as well.
-craig

"You can never have too many knives" -- Logan Nine Fingers
supernova2005
Participant
Posts: 37
Joined: Mon Jan 24, 2005 10:12 am

Post by supernova2005 »

chulett wrote:... which means there are no 'extra' pipes in your file and thus my suggestion should have worked just fine as well.
Craig,

Yes, my assumption is, there is no extra pipes, if there are extra pipes, there is still a way to get around, this is what I think,

step 1. remove all #'s
step 2. replace all "|" by #
step 3. remove all | ( the single character |, not the string "|" ), and remove all "
step 4. add " at the begining and end of each line
step 5. repace # by |

done.
Post Reply