Dynamically processing csv files based on header row

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
DatastageSolutions
Participant
Posts: 7
Joined: Wed Sep 19, 2007 1:56 am
Location: County Durham, UK

Dynamically processing csv files based on header row

Post by DatastageSolutions »

Can anyone think of a neat solution to this problem?

I've got a csv file with approximately 200 columns, which could potentially change position. The first row defines the column names and I would like to pivot the data so that I end up with one record per column along these lines:

Input file:
Key Col1 Col2 Col3
1 2 3 4
5 6 7 8

Output file:
Key Field Value
1 Col1 2
1 Col2 3
1 Col3 4
5 Col1 6
5 Col2 7
5 Col3 8

I've come up with a few ideas but I need something which is quick enough to deal with a few million rows and can handle the columns names changing.

Thanks,
Dave
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Re: Dynamically processing csv files based on header row

Post by ray.wurlod »

DatastageSolutions wrote:Can anyone think of a neat solution to this problem?
No, at least not in a server job.

In the parallel environment you might be able to do something with a set of parallel jobs - push the first row out into a schema file then use that in a subsequent job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Hi Dave,

Stepping outside of DataStage, you could use a Perl script to do this for you. If it were programmed correctly, there would be no issue with changing parameters AS LONG AS the first field is always the KEY. The Perl script would then just create a new file for you with the pivoted values. It could also be set to backup the original file to a new name and modify the data in place.

Let me know if you would have questions on this.

Bestest!
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
DatastageSolutions
Participant
Posts: 7
Joined: Wed Sep 19, 2007 1:56 am
Location: County Durham, UK

Post by DatastageSolutions »

Cheers for the ideas. Not being a Perl expert, I tried using Unix scripting to convert the file. It worked fine but just took too long to run. In the end I knocked up a basic routine which does about 100,000 rows a minute. Good enough for me!

Thanks again,
Dave
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

DatastageSolutions wrote:In the end I knocked up a basic routine which does about 100,000 rows a minute. Good enough for me!
I would love to see your solution if you agree to share it.

I am writing some basic code to do some preprocessing of csv files, things like validating the number of columns, eliminating those extraneous delimiters that you get when a copy-paste is done, checking date formats and transforming it to some standard, etc.

With your solution, I could get everything combined into one nice routine for csv file preprocessing.
gateleys
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

How many rows are you working with? 100,000 per minute is pretty slow especially if you are working with a significant data set. I'll try to put something together for you in Perl real quick and see if I can't improve on the speed. Any chance you could send me the file winzipped? If so, send to jmiceli@wrberkley.com.

Bestest,
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Potential solution

Post by jdmiceli »

Hi Dave (and anyone else that may be interested),

I have a potential solution to this problem for you to try. You are welcome to use it as you see fit and modify it to suit your needs. If you don't have Perl installed, go to Activestate and get it. It is standard on any Unix/Linux install and I wrote it to work with anything as it's very simple. Other Perl experts may be able to do it in fewer lines, but I rarely have any issues with the code running.

All you have to do is put the program in the directory where your file is (or vice versa) and pass it the name of the file, the name of the file you want created to hold the pivoted data and the delimiter you used in the file. When specifying the delimiter, put it in double quotes and use standard pattern matching notation ("\t" = tab, "\n" newline, etc.).

I created a test file with ten values per key and 15400 rows in the file. This script processed it in about three seconds. Hopefully it will help you!

Code: Select all


#!/usr/local/bin/perl

##########################################################################
#  program name: Pivot.pl
#  Author: John Miceli
#  Date: 10/05/2007
#
#  Purpose: To take a flat file with header row and pivot the values
#           based on the first value in a line being the key to the row.
#
#  Usage: perl Pivot.pl sourcefile targetfile delimiter
#       ie: perl Pivot.pl route.txt p_route.txt "\t" (specifies tab)
##########################################################################

use strict;
use warnings;

my ($inputfile, $outputfile, $delimiter, $cnt, $line, $elementcnt);
my ($element, $next, $key, $field, $value);
my (@array, @processing, @headers, @hproc);
my (%positions);

#initialize stuff
$cnt = 0;
$elementcnt = 0;
$next = 0;

$inputfile = "$ARGV[0]";
$outputfile = "$ARGV[1]";
$delimiter = "$ARGV[2]";

open(IN, "< $inputfile") || die "Could not open file $inputfile for reading.\n";
open(OUT, "> $outputfile") || die "Could not open file $outputfile for reading.\n";

# print the headers to the output file using tab delimiter
print OUT "KEY\tFIELD\tVALUE\n";

while ($line = <IN>)
{
    chomp $line;
    $cnt++;

    if ($cnt == 1)
    {
        @headers = split(/$delimiter/, $line);
        $elementcnt = (scalar(@headers)-1);  # get count of headers in file

        foreach $element(@headers)
        {
            if ($element =~ /KEY/)
            {
                next;
            }
            $positions{$next} = $element;
            $next++;
        }

        # get rid of the delimiter elements in the array
        foreach $element(@headers)
        {
            chomp $element;

            if ($element =~ /$delimiter/)
            {
                # do nothing
            }
            else
            {
                push(@hproc,$element);
            }
        }
        next;
    }

    @array = split(/$delimiter/, $line);

    # get rid of the delimiter elements in the array
    foreach $element(@array)
    {
        chomp $element;

        if ($element =~ /$delimiter/)
        {
            # do nothing
        }
        else
        {
            push(@processing,$element);
        }
    }
    $key = shift(@processing);

    # now start building the pivot
    for (my $i=0; $i<= ($elementcnt-1); $i++)
    {
        $field = $positions{$i};
        $value = shift(@processing);
        print OUT "$key\t$field\t$value\n";
    }

}

close IN;
close OUT;

Bestest,
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
DatastageSolutions
Participant
Posts: 7
Joined: Wed Sep 19, 2007 1:56 am
Location: County Durham, UK

Post by DatastageSolutions »

Thanks for that bit of Perl John, can't really send you the live data as its financial information but I'll definitely give it a go :D

For gateleys and anyone else who's interested, here's a chopped down example of the Basic version...

Enjoy!

Code: Select all

      JobName="ConvertCSV"

! Open infile and outfile
      OPENSEQ "infile" TO InputFile Then
         Call DSExecute ("Unix", "rm outfile", CommandOutput, ReturnCode)
         OPENSEQ "outfile" TO OutputFile Else CREATE OutputFile Else Call DSLogFatal("Could not create file: outfile", JobName)

! Read header line and process it
         Readseq line from InputFile Else Call DSLogFatal("Could not read header", JobName)
         Header = Convert(",", @FM, line)
         LineLength = Dcount(Header, @FM)

         Done = @FALSE
         Counter = 1
         Loop
            Readseq line from InputFile Else Done = @TRUE

! This converts K1 K2 K3 A B C D
! to
! K1 K2 K3 A
! K1 K2 K3 B
! K1 K2 K3 C
! K1 K2 K3 D
!
            Item1 = line[",",1,1]
            Item2 = line[",",2,1]
            Item3 = line[",",3,1]

            For Item = 4 to LineLength
               Value=line[",",Item,1]
               Writeseq Item1:",":Item2:",":Item3:",":Value to OutputFile Else Call DSLogFatal("Error writing record", JobName)
            Next
            If Mod (Counter, 100000) = 0
            Then
               Call DSLogInfo ("Processed " : Counter, JobName)
            End
            Counter += 1
         Until Done Repeat
      End Else Call DSLogFatal("Could not open file infile", JobName)
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Please let me know if it worked or if there is something I can do to improve the script. I am adding this to my directories of utilities I have written. I have also added it to the open topic I have for Perl scripts I have made available for anyone to use.

I figure maybe having a bunch in one place will help others as much as they have helped me! :lol:

Bestest!
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
Post Reply