Issues while reading a Sequential File

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Issues while reading a Sequential File

Post by thebird »

Hi,

I have a job that reads data from a sequential file. The input data has around 40-50 fields. The data is comma delimited, with the Varchar and Char fields in double quotes.So, in the sequentail file stage, I have set the Delimiter property to "comma" and Quotes to "Double".

One of the fields (7th field) has data which contains a double quote in between to specify inches. Because of this particular quote, the data is not being read properly.

For Eg.

Input Data -

....col6,col7,col8,col9,col10......
2234,"BLACK 5" - BELT",34.578,"H",67.9

This data is being read something like -

col6|col7|col8|col9|col10|col11|.......

2234|BLACK 5|BELT|3|4.578|"H|....

This is not an exact depiction. What I am trying to say is that because of the double quote (which is part of the data) in the 7th field, the remaining data in that field gets shifetd over to the next field, thus, causing the remaining data in that row also to be shifted.

Is there a way that I can handle this? How can I get the particular quote to be treated as data?

Regards,

The Bird.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

This is not an uncommon problem.

The easiest way to fix this is to make sure that the program creating the sequential file does it correctly - either through use of another quote character or doubling up the quotes as per csv specs.

If you cannot do this then you need to modify the sequential file to match this. The rule you can use is that the first character and the last character of each line are valid quotes; plus any occurrences of '","' within the line. Any other double-quotes in the line need to be replaced with 2-double-quotes.

I recall having seen this exact problem thread recently and think a solution in BASIC code was posted there.
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Hi Arnd,
The easiest way to fix this is to make sure that the program creating the sequential file does it correctly - either through use of another quote character or doubling up the quotes as per csv specs.
Can you please explain what doubling up the quotes as per csv specs mean, I didnt get the point.

Also thanks for the quick response.

Regards,

The Bird.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If you have embedded quotes in a string they are doubled (the following is a correctly formatted CSV string:

Code: Select all

"This line is about 4"" long.",1,2,3,"Hello's to all. ""Hi There"" all!"
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

ArndW wrote:If you have embedded quotes in a string they are doubled (the following is a correctly formatted CSV string:

Code: Select all

"This line is about 4"" long.",1,2,3,"Hello's to all. ""Hi There"" all!"
Arnd,

In this case how do I read the data in the right format from the sequential file stage? Which property would I have to set in the stage?

How does DS recognise the doubled double quote as part of data, as we will be setting the Delimiter property to "comma" and Quotes to "Double" in the Field Defaults option in the stage. Or am I wrong in thinking so?



Regards,

The Bird.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

TheBird,

:? <sigh>. Please read my original response. There is no easy "just click somewhere" option to fix this, unless you can make the person giving you this data change their method.

1. You now know what the problem is.
2. You now know what the rules are to fix this problem.
3. You are a DataStage professional and can use DS or any or tool to re-format your source file according to the rules in (2) [see original response to your query]

DataStage parses the input string and if the character after a quote character is another quote character it KNOWS that this is not a delimiter,but an internal quote; it removes the doubled quote and continues parsing the string. This is standard CSV methodology. Open up your EXCEL, put in a string with embedded doubke quotes and write it to a CSV file and you will understand.
Post Reply