PSV file has quote in the text body
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 37
- Joined: Mon Jan 24, 2005 10:12 am
PSV file has quote in the text body
Good evening everyone,
I have a simple parallel job that reads a PSV file, that is, Pipe delimited file with double quotes, here is a sample
"aaa"|"bbb"|"ccc"|"dddd"
"eeee"|"fffff"|"ggggggg"|"hhhhhhh"
but the job skipped records, and when I inspect the file, it looks like some of the fields have quotes in the text itself, like this
"aaaaaa"|"bbbbbbbbb"|"ccccc "ccc" c"|"ddddddddd"
and wherever there are extract quotes in the text, the sequential file stage skips the entire row, can anyone tell me how to configure the squential file stage to handel this?
thank you in advance!
I have a simple parallel job that reads a PSV file, that is, Pipe delimited file with double quotes, here is a sample
"aaa"|"bbb"|"ccc"|"dddd"
"eeee"|"fffff"|"ggggggg"|"hhhhhhh"
but the job skipped records, and when I inspect the file, it looks like some of the fields have quotes in the text itself, like this
"aaaaaa"|"bbbbbbbbb"|"ccccc "ccc" c"|"ddddddddd"
and wherever there are extract quotes in the text, the sequential file stage skips the entire row, can anyone tell me how to configure the squential file stage to handel this?
thank you in advance!
You can't.
That's the challenge with delimited files. You need to choose a delimiter that will not appear in the data. If there is a chance that the delimiter will appear in the data you need to quote it (and then the quote character can not appear in the data).
You're SOL when both the delimiter and quote character appear in the data.
Might be an interesting exercise in 8.5 to see if the new looping functionality might be leveraged to solve this common problem.
Your file needs to be scrubbed. I would try cleaning it with a server job or some kind of script (awk/sed/perl).
Mike
That's the challenge with delimited files. You need to choose a delimiter that will not appear in the data. If there is a chance that the delimiter will appear in the data you need to quote it (and then the quote character can not appear in the data).
You're SOL when both the delimiter and quote character appear in the data.
Might be an interesting exercise in 8.5 to see if the new looping functionality might be leveraged to solve this common problem.
Your file needs to be scrubbed. I would try cleaning it with a server job or some kind of script (awk/sed/perl).
Mike
-
- Participant
- Posts: 37
- Joined: Mon Jan 24, 2005 10:12 am
-
- Participant
- Posts: 37
- Joined: Mon Jan 24, 2005 10:12 am
Craig,chulett wrote:I would think that you could tell it that your fields are not quoted so they come in as data and then you can trim off only the leading/trailing quotes. ...
That is what I thought, but it looks like the sequential file stage does not take "|" as a delimit string. whenever I hit ok, and open the sequential file stage again, it is back to |, I guess " is a special character and I probably need a special way to represent it.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Use a server job (or a server Sequential File stage in a server shared container). This is much better at handling delimiter character inside quoted text than is its parallel equivalent.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 37
- Joined: Mon Jan 24, 2005 10:12 am
Mike,Mike wrote:If you want to try your luck with a 3-character delimiter like "|" (might work if you don't have a pipe at the start or end of your record), then you need to use the delimiter string property. The delimiter character property is limited to a single character.
Mike
I did use delimiter string stead of delimiter character, but it seems that the sequential file stage doesn't understand what "|" means, maybe " is a special character or something.
Thanks.
-
- Participant
- Posts: 37
- Joined: Mon Jan 24, 2005 10:12 am
This is my solution in the end, I wrote a simple shell script to remove all quotes in the incoming directory, then use to sequential file stage to read all *.new files, after jobs are completed, remove all *.new files.
Per suggestion from Ray, yes, Server may work, but I want to use the "file pattern" option in PX so I decided to use a parallel job instead.
this is the extra script I wrote.
#!/bin/ksh
input_file_path=$1
cd $input_file_path
for file in * ; do
cat $file | sed s/'"'/''/g >$file.new
done
Per suggestion from Ray, yes, Server may work, but I want to use the "file pattern" option in PX so I decided to use a parallel job instead.
this is the extra script I wrote.
#!/bin/ksh
input_file_path=$1
cd $input_file_path
for file in * ; do
cat $file | sed s/'"'/''/g >$file.new
done
-
- Participant
- Posts: 37
- Joined: Mon Jan 24, 2005 10:12 am
Craig,chulett wrote:... which means there are no 'extra' pipes in your file and thus my suggestion should have worked just fine as well.
Yes, my assumption is, there is no extra pipes, if there are extra pipes, there is still a way to get around, this is what I think,
step 1. remove all #'s
step 2. replace all "|" by #
step 3. remove all | ( the single character |, not the string "|" ), and remove all "
step 4. add " at the begining and end of each line
step 5. repace # by |
done.