Page 1 of 1

Determining if sequential file has correct number of columns

Posted: Thu Feb 19, 2009 11:13 pm
by sbass1
Hi,

Say I have a CSV file like so:

Foo,Bar,"Unit 1, 234 Main St, Sydney",Blah

With the correct sequential file link format parameters, this will become:

Col1=Foo
Col2=Bar
Col3=Unit 1, 234 Main St, Sydney
Col4=Blah

I have a requirement to ensure the input file has the correct number of columns.

My validation code reads the entire line as one long string and counts commas, like so:

ColumnCount = Dcount(Extract.DataLine,",") (ColumnCount is a stage var)
BadRow = ColumnCount ne $SoureFileColumnCount (Job parameter == 72)

You see my quandry above - I reject rows where commas are embedded within the quote characters.

Now, I may be able to get the source system to create the file with another delimiter, say tilde, where I'm guaranteed that it will ONLY be EVER used as a delimiter only. If so, my functional problem is solved.

However, I still have these questions:

1) Is the above a good approach (counting delimiters in the incoming file?)

2) I note with great sadness that regular expressions are not supported (at least OOTB) in DS 7.x. I have read for example viewtopic.php?t=107882.

If regex's were supported, there are some complex regex's that will return "all hits for char X except when between quotes". It's been a while since I've coded one of these, but Google is my friend...

So, is there the functional equivalent in DS of "return a count of char X unless it is between quotes" without the use of reqular expressions?

Thanks,
Scott

Posted: Thu Feb 19, 2009 11:33 pm
by ray.wurlod
Resist stupid requrements!

Let the metadata do it for you.
If you must, define a fifth field (VarChar). If it is populated, then the line has too many fields.
Too few is a harder problem if field #4 can legitimately be empty.

Posted: Fri Feb 20, 2009 12:58 am
by ray.wurlod
It is easy enough to write such a function. Probably many have done so.

Posted: Fri Feb 20, 2009 4:35 pm
by sbass1
1. Yes, things are always easy to do when one knows how to do it. Would one of the "many who have do so" care to chime in with the details?

2. In the previous post I supplied the link for, it was for parallel jobs; does that post also apply to 7.x server jobs?

3. If it is so easy to do so, I wonder why the fleet of programmers at IBM haven't implemented it? It's certainly available in other ETL products I've used.

4. So are regex's supported in 8.x?

Posted: Fri Feb 20, 2009 5:08 pm
by ray.wurlod
1. That's the kind of thing I do for a living. See this page for example.

2. It's rarely a safe assumption that any answer relating to parallel jobs applies to server jobs.

3. Where's the business case? IBM don't do anything that won't turn a dollar.

4. No more than in earlier versions. Even the post to which you linked earlier included the statement that the full functionality of regular expressions is not supported out of the box.

Posted: Sat Feb 21, 2009 8:16 am
by Mike
Pre-validation of delimited flat files is not something that I've actually done. You are already well aware of the perils associated with delimited text files.

Here is the what I might typically do:
1) Ask for a fixed-width flat file.
2) Ask for a delimited file with a delimiter that is guaranteed by the sender to not be in the data.
3) Define the metadata in the sequential file stage and let the stage complain about too few or too many columns. Allow this to abort the job (I always prefer to set my warning limit at 1).

If none of these are acceptable, then I might create a simple DataStage job to preprocess and validate the file structure (I've never actually gotten as far as this option):
1) Define the sequential file with no field or record delimiters or quote characters.
2) Define a single CHAR(1) column.
3) Read the file one byte at a time counting delimiters. Stop counting when the quote character is read, resume counting when the next quote character is read, start over when a newline character is read.

If everything is parameterized, then this could become a reusable utility job.

Mike

Re: Determining if sequential file has correct number of col

Posted: Tue Feb 24, 2009 8:20 am
by PhilHibbs
sbass1 wrote:Say I have a CSV file like so:

Foo,Bar,"Unit 1, 234 Main St, Sydney",Blah
...
I have a requirement to ensure the input file has the correct number of columns.
...
If regex's were supported, there are some complex regex's that will return "all hits for char X except when between quotes". It's been a while since I've coded one of these, but Google is my friend...
I would be wary of a regex for this anyway. What convention does the file use for embedded quote characters? If it doubles them up like Excel and DataStage Server Jobs dos, then you're ok, but if it uses " like DataStage Parallel jobs do and Server jobs can be configured to do, then your regex becomes a lot more complicated.