Page 1 of 1

CSV Data File with embedded quote

Posted: Thu May 15, 2003 4:04 pm
by datastagedummy
I have a comma separated source file which looks like

"10001257", "SUCCINIMIDE (ORDER ON"B")", "KG","S","B","N","A","Y","N","S","","","I","TJB","TJB","N",3,365.230000000,0.000000000,0.000000000,0.000000000,0.000000000,0.000000000,0.000000000,0.000000000,"0101-R032-12501","P052","","",0.000,"0","1",18.669158000,0.000000000,0.000000000,0.000000000,0.000000000,18.669158000,0.000000000,0.000000000,0.000000000,0.000000000,"20021231",18.669158000,"20021231","C","","","Y","C","Y",11415.410000000,0.000000000,"39","217361","O",54.230000000,"","19800101","SPGS237","20011119","T"

If you notice in the second column the text has an embedded quote in it.

In the Sequential file stage that i read this file I have the format set to Delimiter=, (comma) and Quote character = " (quote)

But when I run my job I get the following error

Is there a way to collect this kind of invalid data into a reject file rather than the whole job aborting with a fatal error.

Thanks to all of you

Posted: Thu May 15, 2003 7:17 pm
by ray.wurlod
What you have is an invalidly-formatted CSV file.

Scroll to the right in your Sequential File stage, and you will discover some extra columns that allow you to specify the rules for handling incomplete rows and for rows containing extraneous data.

Rather than have me type it all here, click on Help and read all about it there. [8D]

You can generate information from the Sequential File stage that causes a subsequent Transformer stage to redirect this row wherever you will.


Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

Posted: Thu May 15, 2003 10:26 pm
by datastagedummy
I did try changing the Incomplete Columns to "Discard and Warn" but it keeps telling me that "The Incomplete Column col2 must have values in ascending order"
I cannot set it to anything other than the "Error"

Like you suggested where would I put the rule to trap my type of errors as I know that it's only the 2nd column in the file that could give me the problem.

Thanks

Posted: Thu May 15, 2003 11:43 pm
by ray.wurlod
These rules form a hierarchy; you can't have a higher-level rule (such as Error) below a lower-level rule (such as Discard). This is all explained in the Help for the SequentialFile stage.
Once you have that right, you can then put your rule in a following Transformer stage.

Posted: Mon May 19, 2003 3:24 pm
by datastagedummy
Thanks Manjula but that wouldn't help as a "000" for quote character brings in the " as a text and it crashes if I have , (comma) in the columns.

Anybody else has any ideas ?

Posted: Mon May 19, 2003 4:53 pm
by chulett
Do you have any way to either preprocess the file or have the source correct the improper quoting?

Turning off the quotes should work fine, you would just strip them out in a transform. Unless, as you stated, there was a comma in a quoted string - then you need the quotes so it can tell the difference between field separators and embedded commas. Can't have it both ways. Bottom line seems to be a need to fix the data before it can be properly read.

On your source system, the (ORDER ON"B") should really be (ORDER ON'B') if you are going to dump it to CSV. Either that or quote the embedded quotes - replace each " with "" and I think you'll be ok then. It would take some playing around to get a good solution, what platform are you running on?

Also, you never mentioned what your original error was - probably had what the board thought was tags around it and it ate it. I don't believe your job should have fatally aborted, it should have logged some warnings and not processed the bad rows but only aborted if the warning count exceeded your threshold. What was your fatal error?

-craig

Posted: Mon May 19, 2003 6:15 pm
by ray.wurlod
To clarify:

Switch off the quote character, by setting it to 000.
All your quotes characters will now be read as part of the data. You may need to fiddle with data types and so on, but there's only one data type ("text") in a text file anyway.

To strip leading and trailing quote characters use the extended form of the Trim function in column derivation expressions in a Transformer stage. For example:
Trim(InLink.ColumnName, '"', 'B')
You may like to store this as a Transform (maybe called StripQuotes), since you'll be using it over and over.

Posted: Tue May 20, 2003 1:07 pm
by datastagedummy
Does anybody knows of any document which speaks about the risks of working with different Source System specially CSV.

Would appreciate if you could tell me the Link to that