Dynamically processing csv files based on header row

DatastageSolutions · Post by **DatastageSolutions** » Tue Oct 02, 2007 9:17 am

Can anyone think of a neat solution to this problem?

I've got a csv file with approximately 200 columns, which could potentially change position. The first row defines the column names and I would like to pivot the data so that I end up with one record per column along these lines:

Input file:
Key Col1 Col2 Col3
1 2 3 4
5 6 7 8

Output file:
Key Field Value
1 Col1 2
1 Col2 3
1 Col3 4
5 Col1 6
5 Col2 7
5 Col3 8

I've come up with a few ideas but I need something which is quick enough to deal with a few million rows and can handle the columns names changing.

Thanks,
Dave

ray.wurlod · Post by **ray.wurlod** » Tue Oct 02, 2007 2:36 pm

DatastageSolutions wrote:Can anyone think of a neat solution to this problem?

No, at least not in a server job.

In the parallel environment you might be able to do something with a set of parallel jobs - push the first row out into a schema file then use that in a subsequent job.

jdmiceli · Post by **jdmiceli** » Thu Oct 04, 2007 4:43 pm

Hi Dave,

Stepping outside of DataStage, you could use a Perl script to do this for you. If it were programmed correctly, there would be no issue with changing parameters AS LONG AS the first field is always the KEY. The Perl script would then just create a new file for you with the pivoted values. It could also be set to backup the original file to a new name and modify the data in place.

Let me know if you would have questions on this.

Bestest!

DatastageSolutions · Post by **DatastageSolutions** » Fri Oct 05, 2007 5:19 am

Cheers for the ideas. Not being a Perl expert, I tried using Unix scripting to convert the file. It worked fine but just took too long to run. In the end I knocked up a basic routine which does about 100,000 rows a minute. Good enough for me!

Thanks again,
Dave

gateleys · Post by **gateleys** » Fri Oct 05, 2007 7:24 am

DatastageSolutions wrote:In the end I knocked up a basic routine which does about 100,000 rows a minute. Good enough for me!

I would love to see your solution if you agree to share it.

I am writing some basic code to do some preprocessing of csv files, things like validating the number of columns, eliminating those extraneous delimiters that you get when a copy-paste is done, checking date formats and transforming it to some standard, etc.

With your solution, I could get everything combined into one nice routine for csv file preprocessing.

jdmiceli · Post by **jdmiceli** » Fri Oct 05, 2007 8:25 am

How many rows are you working with? 100,000 per minute is pretty slow especially if you are working with a significant data set. I'll try to put something together for you in Perl real quick and see if I can't improve on the speed. Any chance you could send me the file winzipped? If so, send to jmiceli@wrberkley.com.

Bestest,

jdmiceli · Post by **jdmiceli** » Fri Oct 05, 2007 1:25 pm

Hi Dave (and anyone else that may be interested),

I have a potential solution to this problem for you to try. You are welcome to use it as you see fit and modify it to suit your needs. If you don't have Perl installed, go to Activestate and get it. It is standard on any Unix/Linux install and I wrote it to work with anything as it's very simple. Other Perl experts may be able to do it in fewer lines, but I rarely have any issues with the code running.

All you have to do is put the program in the directory where your file is (or vice versa) and pass it the name of the file, the name of the file you want created to hold the pivoted data and the delimiter you used in the file. When specifying the delimiter, put it in double quotes and use standard pattern matching notation ("\t" = tab, "\n" newline, etc.).

I created a test file with ten values per key and 15400 rows in the file. This script processed it in about three seconds. Hopefully it will help you!

Code: Select all


#!/usr/local/bin/perl

##########################################################################
#  program name: Pivot.pl
#  Author: John Miceli
#  Date: 10/05/2007
#
#  Purpose: To take a flat file with header row and pivot the values
#           based on the first value in a line being the key to the row.
#
#  Usage: perl Pivot.pl sourcefile targetfile delimiter
#       ie: perl Pivot.pl route.txt p_route.txt "\t" (specifies tab)
##########################################################################

use strict;
use warnings;

my ($inputfile, $outputfile, $delimiter, $cnt, $line, $elementcnt);
my ($element, $next, $key, $field, $value);
my (@array, @processing, @headers, @hproc);
my (%positions);

#initialize stuff
$cnt = 0;
$elementcnt = 0;
$next = 0;

$inputfile = "$ARGV[0]";
$outputfile = "$ARGV[1]";
$delimiter = "$ARGV[2]";

open(IN, "< $inputfile") || die "Could not open file $inputfile for reading.\n";
open(OUT, "> $outputfile") || die "Could not open file $outputfile for reading.\n";

# print the headers to the output file using tab delimiter
print OUT "KEY\tFIELD\tVALUE\n";

while ($line = <IN>)
{
    chomp $line;
    $cnt++;

    if ($cnt == 1)
    {
        @headers = split(/$delimiter/, $line);
        $elementcnt = (scalar(@headers)-1);  # get count of headers in file

        foreach $element(@headers)
        {
            if ($element =~ /KEY/)
            {
                next;
            }
            $positions{$next} = $element;
            $next++;
        }

        # get rid of the delimiter elements in the array
        foreach $element(@headers)
        {
            chomp $element;

            if ($element =~ /$delimiter/)
            {
                # do nothing
            }
            else
            {
                push(@hproc,$element);
            }
        }
        next;
    }

    @array = split(/$delimiter/, $line);

    # get rid of the delimiter elements in the array
    foreach $element(@array)
    {
        chomp $element;

        if ($element =~ /$delimiter/)
        {
            # do nothing
        }
        else
        {
            push(@processing,$element);
        }
    }
    $key = shift(@processing);

    # now start building the pivot
    for (my $i=0; $i<= ($elementcnt-1); $i++)
    {
        $field = $positions{$i};
        $value = shift(@processing);
        print OUT "$key\t$field\t$value\n";
    }

}

close IN;
close OUT;

Bestest,

DatastageSolutions · Post by **DatastageSolutions** » Mon Oct 08, 2007 9:04 am

Thanks for that bit of Perl John, can't really send you the live data as its financial information but I'll definitely give it a go :D

For gateleys and anyone else who's interested, here's a chopped down example of the Basic version...

Enjoy!

Code: Select all

      JobName="ConvertCSV"

! Open infile and outfile
      OPENSEQ "infile" TO InputFile Then
         Call DSExecute ("Unix", "rm outfile", CommandOutput, ReturnCode)
         OPENSEQ "outfile" TO OutputFile Else CREATE OutputFile Else Call DSLogFatal("Could not create file: outfile", JobName)

! Read header line and process it
         Readseq line from InputFile Else Call DSLogFatal("Could not read header", JobName)
         Header = Convert(",", @FM, line)
         LineLength = Dcount(Header, @FM)

         Done = @FALSE
         Counter = 1
         Loop
            Readseq line from InputFile Else Done = @TRUE

! This converts K1 K2 K3 A B C D
! to
! K1 K2 K3 A
! K1 K2 K3 B
! K1 K2 K3 C
! K1 K2 K3 D
!
            Item1 = line[",",1,1]
            Item2 = line[",",2,1]
            Item3 = line[",",3,1]

            For Item = 4 to LineLength
               Value=line[",",Item,1]
               Writeseq Item1:",":Item2:",":Item3:",":Value to OutputFile Else Call DSLogFatal("Error writing record", JobName)
            Next
            If Mod (Counter, 100000) = 0
            Then
               Call DSLogInfo ("Processed " : Counter, JobName)
            End
            Counter += 1
         Until Done Repeat
      End Else Call DSLogFatal("Could not open file infile", JobName)

jdmiceli · Post by **jdmiceli** » Mon Oct 08, 2007 9:12 am

Please let me know if it worked or if there is something I can do to improve the script. I am adding this to my directories of utilities I have written. I have also added it to the open topic I have for Perl scripts I have made available for anyone to use.

I figure maybe having a bunch in one place will help others as much as they have helped me!

Bestest!

DSXchange

Dynamically processing csv files based on header row

Dynamically processing csv files based on header row

Re: Dynamically processing csv files based on header row

Potential solution