Determining if sequential file has correct number of columns

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
sbass1
Premium Member
Premium Member
Posts: 211
Joined: Wed Jan 28, 2009 9:00 pm
Location: Sydney, Australia

Determining if sequential file has correct number of columns

Post by sbass1 »

Hi,

Say I have a CSV file like so:

Foo,Bar,"Unit 1, 234 Main St, Sydney",Blah

With the correct sequential file link format parameters, this will become:

Col1=Foo
Col2=Bar
Col3=Unit 1, 234 Main St, Sydney
Col4=Blah

I have a requirement to ensure the input file has the correct number of columns.

My validation code reads the entire line as one long string and counts commas, like so:

ColumnCount = Dcount(Extract.DataLine,",") (ColumnCount is a stage var)
BadRow = ColumnCount ne $SoureFileColumnCount (Job parameter == 72)

You see my quandry above - I reject rows where commas are embedded within the quote characters.

Now, I may be able to get the source system to create the file with another delimiter, say tilde, where I'm guaranteed that it will ONLY be EVER used as a delimiter only. If so, my functional problem is solved.

However, I still have these questions:

1) Is the above a good approach (counting delimiters in the incoming file?)

2) I note with great sadness that regular expressions are not supported (at least OOTB) in DS 7.x. I have read for example viewtopic.php?t=107882.

If regex's were supported, there are some complex regex's that will return "all hits for char X except when between quotes". It's been a while since I've coded one of these, but Google is my friend...

So, is there the functional equivalent in DS of "return a count of char X unless it is between quotes" without the use of reqular expressions?

Thanks,
Scott
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Resist stupid requrements!

Let the metadata do it for you.
If you must, define a fifth field (VarChar). If it is populated, then the line has too many fields.
Too few is a harder problem if field #4 can legitimately be empty.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It is easy enough to write such a function. Probably many have done so.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sbass1
Premium Member
Premium Member
Posts: 211
Joined: Wed Jan 28, 2009 9:00 pm
Location: Sydney, Australia

Post by sbass1 »

1. Yes, things are always easy to do when one knows how to do it. Would one of the "many who have do so" care to chime in with the details?

2. In the previous post I supplied the link for, it was for parallel jobs; does that post also apply to 7.x server jobs?

3. If it is so easy to do so, I wonder why the fleet of programmers at IBM haven't implemented it? It's certainly available in other ETL products I've used.

4. So are regex's supported in 8.x?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

1. That's the kind of thing I do for a living. See this page for example.

2. It's rarely a safe assumption that any answer relating to parallel jobs applies to server jobs.

3. Where's the business case? IBM don't do anything that won't turn a dollar.

4. No more than in earlier versions. Even the post to which you linked earlier included the statement that the full functionality of regular expressions is not supported out of the box.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Pre-validation of delimited flat files is not something that I've actually done. You are already well aware of the perils associated with delimited text files.

Here is the what I might typically do:
1) Ask for a fixed-width flat file.
2) Ask for a delimited file with a delimiter that is guaranteed by the sender to not be in the data.
3) Define the metadata in the sequential file stage and let the stage complain about too few or too many columns. Allow this to abort the job (I always prefer to set my warning limit at 1).

If none of these are acceptable, then I might create a simple DataStage job to preprocess and validate the file structure (I've never actually gotten as far as this option):
1) Define the sequential file with no field or record delimiters or quote characters.
2) Define a single CHAR(1) column.
3) Read the file one byte at a time counting delimiters. Stop counting when the quote character is read, resume counting when the next quote character is read, start over when a newline character is read.

If everything is parameterized, then this could become a reusable utility job.

Mike
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: Determining if sequential file has correct number of col

Post by PhilHibbs »

sbass1 wrote:Say I have a CSV file like so:

Foo,Bar,"Unit 1, 234 Main St, Sydney",Blah
...
I have a requirement to ensure the input file has the correct number of columns.
...
If regex's were supported, there are some complex regex's that will return "all hits for char X except when between quotes". It's been a while since I've coded one of these, but Google is my friend...
I would be wary of a regex for this anyway. What convention does the file use for embedded quote characters? If it doubles them up like Excel and DataStage Server Jobs dos, then you're ok, but if it uses " like DataStage Parallel jobs do and Server jobs can be configured to do, then your regex becomes a lot more complicated.
Phil Hibbs | Capgemini
Technical Consultant
Post Reply