Reading in strings with quotes from a csv file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vinnz
Participant
Posts: 92
Joined: Tue Feb 17, 2004 9:23 pm

Reading in strings with quotes from a csv file

Post by vinnz »

I have a job that reads in a comma separated file with " (double quotes) as the quote character and this was working fine for the last few months until today when we received a file which had a double quote in the data. The particular record looks like the below.

...,"White","LX (AT-16" TIRES)","12" ...

The job aborts with a fatal error while reading in this record. One possible way I could think of preventing this is to suppress the quote character in the sequential stage and have a filter command remove ALL double & single quotes from the file.

Is there another way that is better? The fatal error I get is below

Code: Select all

SEQ_IntfHST.LNK_IntfOUT: read_delimited() - invalid quotes, column T_SRS = "LX (AT-16"
thanks in advance,
paul
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Reading in strings with quotes from a csv file

Post by chulett »

vinnz wrote:Is there another way that is better?
Unless the source can correct this and stop sending you an invalid file, I don't think you have much choice. :?

Set the quote character to '000' to disable it and then remove the quotes from the data, either field-by-field in the job or by utilizing the Filter capability of the Sequential File stage with something like 'sed'.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vinnz
Participant
Posts: 92
Joined: Tue Feb 17, 2004 9:23 pm

Post by vinnz »

Thanks for the reply Craig. I've tried this but now there is another problem, the data also has commas in it, so disabling the quote character or removing quotes altogether causes the field to be read as 2 fields. Is it possible do something like removing quotes inside quoted strings or escape it for DataStage?

...,"White","LX (AT-16 " TIRES)","12","chicago, IL", ...

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It's possible but messy; you need to process the entire string a character at a time keeping track of whether or not you're inside a quoted string, generating "clean" output. You can do this with sed or awk or C or DataStage BASIC or Perl or whatever else takes your fancy. For quotes within quotes you may need context detection (DataStage BASIC pattern matching is good for this, Matches \1N0N'"'\ will detect (match) patterns like 16".

Far better is to demand sensible input; after all, if you can't process it, chances are that they can't either. GIGO
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
leomauer
Premium Member
Premium Member
Posts: 100
Joined: Mon Nov 03, 2003 1:33 pm

Post by leomauer »

Logically it can be done by removing a quote character not followed by comma. I am not too fluent in regular expressions but I believe it can be coded. Of course if you have the ", combination inside the field then there is no way you can recognize it.
Therefore Ray is right - GIGO.
vinnz
Participant
Posts: 92
Joined: Tue Feb 17, 2004 9:23 pm

Post by vinnz »

Thanks for the replies. Yes, I believe I will have to look in the file for quotes that are not preceded or followed by a comma or a line terminator and isnt the first character on the line. I do not know if DataStage BASIC allows for such regular expressions.

I would like to get it fixed at the external source but the chances are that they will just rectify this one record intead of putting in a fix anytime soon.

GIGO??
davidnemirovsky
Participant
Posts: 85
Joined: Fri Jun 04, 2004 2:30 am
Location: Melbourne, Australia
Contact:

Post by davidnemirovsky »

GIGO = "Garbage In, Garbage Out"

The best option is of course to change the source data. But it looks like you won't be getting a fix in the soruce system anytime soon. It really depends on how often you will be getting data with the double quotes character inside the actual field. A nice way to fix it might be to change the operational system to supply a TAB delimited row so that you can accept both quotes and commas in your data if necessary.

If you keep getting data in this way it might be a good idea to write a robust routine in DataStage BASIC. It is more than capable of handling such tasks. Turn off the quotes and the commas and pass the entire row into the routine. You should be able to process it for quotes and commas and return a row delimited with another character.
Cheers,
Dave Nemirovsky
weiven
Participant
Posts: 5
Joined: Wed Oct 13, 2004 1:55 am

Post by weiven »

Most recently i have this problem as well. Im using a csv file with deliminator of comma & double quote (""). With few testing of the deliminator, there is a way to handle this :

--If you are using double quote(" "), then inside the data musn't contain any DOUBLE quote as it will crash the reading of csv file.

OR

--the other way round, if u change your quote to single(' ') , make sure that is isnt any SINGLE quote inside the data.

GOOD LUCK.. :wink:
weiven
vinnz
Participant
Posts: 92
Joined: Tue Feb 17, 2004 9:23 pm

Post by vinnz »

weiven wrote:--If you are using double quote(" "), then inside the data musn't contain any DOUBLE quote as it will crash the reading of csv file.
This is what is happening right now and what I'm trying to get around.
djhigh wrote:If you keep getting data in this way it might be a good idea to write a robust routine in DataStage BASIC. It is more than capable of handling such tasks. Turn off the quotes and the commas and pass the entire row into the routine. You should be able to process it for quotes and commas and return a row delimited with another character.
This is exactly what I am trying to do. Splitting the line into individual fields looks like it is going to take a long time (has a little less than 100 fields)
Or else I would atleast direct to a reject link the row on file containing those offending quotes using a constraint.
Maybe I'm looking in wrong places, the but does 'Matches' allow expressions to recognize a line such as this? - lines with quotes inside quotes or looking at it another way quotes that are not preceded or followed by a comma or a line terminator and isnt the first character on the line?

Thanks for all the help.
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

Another option is to change the Incomplete Column setting from Error to Discard or Discard and Warn to get you past the error. Scroll to the right on the Output / Columns tab of the sequential stage, and you will see this option. You can devise another exception process to deal with the bad rows without impacting the general performance of the daily job.
vinnz
Participant
Posts: 92
Joined: Tue Feb 17, 2004 9:23 pm

Post by vinnz »

chucksmith wrote:Another option is to change the Incomplete Column setting from Error to Discard or Discard and Warn to get you past the error. Scroll to the right on the Output / Columns tab of the sequential stage, and you will see this option. You can devise another exception process to deal with the bad rows without impacting the general performance of the daily job.
chucksmith, thanks for the reply, this setting doesnt seem to help as I get the same invalid quote error. I believe using this option is only valid for missing columns.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Quotes inside quotes can sometimes be detected by counting the quote characters - if the number is odd you've got a problem. Unfortunately if the number is even, you still may have a problem - two problems!
If you know how many character string fields there should be (call it N) then there should be 2N quote characters.

You requirement to detect "quotes that are not preceded or followed by a comma or a line terminator and isn't the first character on the line" will need some AND or OR constructs. For example

Code: Select all

(Index(String, \"\, 1) > 0) And (Not(String Matches \0X'",'0X\) And Not(String Matches \0X',"'0X\))
The Index function simply detects that there is at least one quote character in the string. The two Matches operators detect ", or ," somewhere within the string.

This may not be your exact requirement, but ought to get you started.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

This might not add to much value to the topic but if you can't isolate the column correctly with the comma delimiter, can you get the delimiter changed to somthing that is less likely to appear in the data like the pipe | delimiter. Admittedly this will only help if the delimiter chosen can never appear in the data.

As a last option can you get the file changed to fixed width? You will carry a lot of empty data and your files will be a lot bigger, but you won't have the problem anymore.

Regards
Shane
Post Reply