CSV file with embedded quotes and commas

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
eli.nawas_AUS
Premium Member
Premium Member
Posts: 39
Joined: Tue Apr 15, 2014 9:14 am

CSV file with embedded quotes and commas

Post by eli.nawas_AUS »

I have looked around on google and this forum, as well as the documentation, and experimenting for myself, and not found a solution. If there is no DataStage native solution, I would like confirmation that that is the case. I do not want to see a command-line scripting solution to this problem.

It seems like the Datastage parallel job file input step cannot handle general Excel CSV files. For instance, the below.

30-40 ABC+C,TV,MXN,0.08026,160701,,,SPACE,"20""",,,Cable / Satellite,,"97,253.20","7,805.13",,0,"97,253.20","7,805.13",2,1/6/14,1/12/14,January,"24,313.30","24,313.30","1,951.28","1,951.28",2014
Delimiter for field "Burst_Spend_NNMC_USD" not found; input: {2 0 1 4}, at offset: 368

There are commas embedded in the fields, and also double quotes. Setting the file as double-quoted will fail on the embedded quotes, setting it as not double-quoted will fail on the embedded commas. How can I read a line like this in correctly?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Had you searched more diligently you would have found that the parallel Sequential File stage is really bad at this whereas the server Sequential File stage is brilliant with these and other aberrations (such as missing columns).

Therefore best advice is to use a server Sequential File stage, either in a server job or in a server Shared Container in a parallel job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ssnegi
Participant
Posts: 138
Joined: Thu Nov 15, 2007 4:17 am
Location: Sydney, Australia

Post by ssnegi »

I would suggest that you read the file as varchar for all the fields.
So this will read the " and , also.
Then in transformer stage use loop variable to loop through the line and convert the ", to ''(empty).
This will remove these characters.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Or just do as Ray suggests. Heck, there may not be anything about this Excel reading job that really "needs" to be done in Parallel and a Server job could have it done while PX was still getting its pants on. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Maybe they don't want to remove the characters which are, after all, part of the data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply