Page 1 of 1

Commas in data value...How to fix the existing file ?

Posted: Wed Nov 21, 2007 12:01 am
by kaps
We have a job which extracts from database and writes to a file. Problem is that one of the column had comma as part of the value for some of the records. When written to the file delimter is ,(comma) and quote charecter is 000. so now there is one extra column for some of the records. How can I make these two columns as one column.

I can't re extract from the table as the table is changed. I got to fix this file now.

I can find out the rows which has the extra columns but don't know how to seggragate them from the file. If I can seperate them then I can use concatanate function to merge the two columns.

Please advise...

Thanks

Re: Commas in data value...How to fix ?

Posted: Wed Nov 21, 2007 1:05 am
by sachin1
check for Incomplete column specification like "Retain","Replace","Error".............,
hope that may help.

Posted: Wed Nov 21, 2007 1:07 am
by rleishman
You can get the rows with too many commas as follows:

Code: Select all

sed -e '/\(.*,\)\{25\}/ p' -e 'd' myfile.csv > toomany.csv
Replace the "25" with a number that is 1 more than the "correct" number of commas.

You can get the rows with the right number of commas as follows:

Code: Select all

sed -e '/\(.*,\)\{25\}/ d' myfile.csv > justright.csv
Cat them together after you fix the bad ones.

Re: Commas in data value...How to fix ?

Posted: Wed Nov 21, 2007 5:00 am
by NickH
You could do the seperation in datastage if you wanted. Read the file in with file delimeter 000 quote delimeter 000. Then use the Count($,',')="Correct Number Of Commas" in a constraint and write the results out, with file delimeter 000 quote delimeter 000. You can put the invalid records in a reject file with a slight mod to the constraint or reject row.

Posted: Wed Nov 21, 2007 11:48 am
by kaps
Nick - I am having a problem when I have constraint like count($,',')=21 then it's giving me error as Variable '$' not defined. How can I say take the whole record and count ?

rleishman - I have executed your command. It's still working. I shall post the results.

Thanks for the help...

Posted: Wed Nov 21, 2007 1:46 pm
by jdmiceli
Is the field that has the problem consistently in the same position, or do multiple fields have the issue? If it is always the 17th field, for example, then you could use a little Perl script to find rows that have too many commas and then remove the comma after the beginning of that problem field.

Something like this should work:

Code: Select all


#!/usr/bin/perl -w

my $count = 0;
my $comma_count = 0; # used for the number of commas found in row
my $good_cnt = 0;  # replace this with the total number of commas for a good row
my $bad_comma = 10;  # replace this with the positional count for which #comma is bad above for for each line involved
my $line = '';
my $element = '';
my @array = '';

my $file = "$ARGV[0]";
my $outfile = "fixed_" . "$file";

open (IN, "< $file") || die "Could not open $file for reading!\n";
open (OUT, "> $outfile") || die "Could not open $outfile for writing!\n";

while ($line = <IN>)
{
    chomp $line;    
    $comma_count = tr/\,//c; # counts the number of commas on a line
    
    $count = -1; #perl counts starting at zero, so need to be one less to match array
    if ($comma_count > $good_cnt)
    {
        @array = split (/\,/,$line);
        $line = ''; # can reuse this now that array has the values
        
        foreach $element(@array)
        {
            $count++;
            if ($count == $bad_comma)
            {
                $line = "$line"; # no comma - effectively joins errored field
            } else
            {
                $line = "$line,";
            }
        }
        chop $line; # remove the final comma
    }
    print OUT "$line\n";
}

close IN;
close OUT;

Now, I just kind of did this off the top of my head and haven't really tested it other than for syntax, so test it out. Just run it like this: perl scriptname.pl filename. You also need to identify the number of commas in a good row and replace the zero associated with $good_count with that value. Do the same thing with $bad_comma to tell the program which is the bad position on rows that have more commas than what is in $good_count.

If you decide to use this and need assistance, just ask away!

Hope that helps!