Counting number of columns in a CSV file

BradC · Post by **BradC** » Wed Oct 12, 2005 5:52 pm

Hi,

I'm currently investigating methods of parsing CSV files before processing with a Sequential File stage in DataStage. The goal is to determine if each row in a CSV file has the correct number of columns before attempting to transform/load.

I was hoping to discuss any methods people have used in the past to achieve a similar outcome using DataStage.

I am reviewing a current implementation that uses a UNIX "awk" script to remove rows in a comma delimited file that have too many columns. Unfortunately, the awk script does not take into account rows that contain commas in the data itself - even if the column data is enclosed in double quotes.

While it may be possible to extend the UNIX script to ignore quoted commas, I'd like to consider what options are available in DataStage as an alternative (perhaps using a routine or sequential file stage options).

I'd be keen to hear any experiences from similar situations.

kcbland · Post by **kcbland** » Wed Oct 12, 2005 6:56 pm

Because of the rigid metadata paradigm of sequential file stages, we've all been impacted by the poor implementation of delimited files with embedded delimiters in poorly quoted data files. When both DS Server and PX encounter these files, they parse them poorly, but how can anyone blame them? You can't always guess when a field was supposed to end and be right all of the time.

I've taken the route of building data quality rules into a library for source data that attempt to set boundaries on column data and so when reading these files and parsing into columns, you have a chance to detect an abnormality in the parsing because columns within the row went outside their documented boundaries. This Rule Based Cleansing forms one of the corners in an ETL methodology myself and my assoicates like to talk about.

I've found that it's almost impossible to correctly parse a malformed delimited file. I've taken the route of trying to detect the malformed rows and trap them to exception files.

BradC · Post by **BradC** » Thu Oct 13, 2005 5:19 pm

Thanks for the reply.

What I am actually trying to achieve is to detect rows with the incorrect number of columns before processing, as you suggested is the route you've taken.

The data coming in the columns themselves is well formed. ie. if a column contains a comma, the entire column will be supplied with surrounding quotes. What I want to do is reject any rows that contain 12 columns instead of 11.

Has anyone had experience pre-processing CSV files in DataStage to count the number of columns in a CSV file?

at095nb · Post by **at095nb** » Thu Oct 13, 2005 5:50 pm

I did count the number of columns but in | delimited file using Count() function.
You may count how many commas you have in your entire record.
If it's more then you expected, count how many quotes you have.
For any additional comma you need 2 quotes.

ameyvaidya · Post by **ameyvaidya** » Thu Oct 13, 2005 6:19 pm

If the file is well-formed, and the only issue is that of figuring out if the file has too many or too few columns, there is one trick which can be used (Server Jobs Only):

In the columns tab of a sequential file in server jobs there is a column titled "Incomplete columns".

Set this value for all rows to 'Replace".

Add an additional row.( eg if the file has 9 columns define a 10'th row in your columns tab.

Now in the next transformer, if the 10'th row is not null, the file has too many columns.
if the 9'th row is null the file has too few columns.

HTH

BradC · Post by **BradC** » Thu Oct 13, 2005 7:49 pm

I did look at the count() function early on, but the issue is with quoted commas as you say at095nb.

Adding a dummy column to the sequential file stage and checking if it is null sounds like it might be the way to go. Unfortunately, I won't be able to check if there is too few rows using this method as it is valid for the last column to come through as null.

I guess the only other option in DataStage would be to write a routine that checks the number of columns programatically.

ameyvaidya · Post by **ameyvaidya** » Thu Oct 13, 2005 8:33 pm

Brad,

Just thinking out loud,
1.
If it is valid for the 9'th column to contain null, the best you can do is to check for the 8'th column that should not contain null (or if not, the 7'th and so on).
Then even if the 9'th column is missing, It will be replaced with null; The ETL logic u've designed will accept this row as valid (if that is what is needed).

2.
Another thing is that In sequential files true NULL values can not exist (I think). A Placeholder is used for representing NULL. And Im away from datastage right now but I believe there is an option to define what the placeholder is to be used for NULL.

Now if you write a file with # as the null placeholder and read the file with * as the Null placeholder, All the # are automatically Non null Values. Now the only nulls in your file are if columns are missing.

In a transformer downstream, you can check if the value is "#" then @NULL else <in.value>.

Hope this helps inyour design!!

at095nb · Post by **at095nb** » Fri Oct 14, 2005 11:35 am

[quote="at095nb"]I did count the number of columns but in | delimited file using Count() function.
You may count how many commas you have in your entire record.
If it's more then you expected, count how many quotes you have.
For any additional comma you need 2 quotes.[/quote]

I will try to make myself clear and I am sure it's working.
You may define 2 stage variables:
vcomma=Count(InputRec, ",")
vquote=Count(InputRec, '"')

Use them to define constrains (where n is a number of expected columns in input file):
for "good" records: vcomma=n Or (vcomma>n And vquote/(vcomma-n)=2.0 ) - every additional comma do have 2 quotes
for "bad" records: vcomma<n Or (vcomma>n And vquote/(vcomma-n<>2.0 ) - less numbers of comma-missing field or additional comma doen't have pair of quotes
where n is a number of expected columns in input file.

For ex:
Your input file has 5 fields. That means the record should have 4 commas (vcomma=4) or 5 commas and 2 quotes (5 And 2/(5-4)=2.0 )

BradC · Post by **BradC** » Sun Oct 16, 2005 5:08 pm

Thanks for the suggestions everyone. It looks like I've got some options to try out. Both of those last options sound like they could work well given the correct situation.

It's likely the actual file data will end up governing the solution I decide on. I'll need to confirm which columns can have null values, if there can be more than one column containing additional commas and if those columns only contain one additional comma each.