Dynamically processing csv files based on header row
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 7
- Joined: Wed Sep 19, 2007 1:56 am
- Location: County Durham, UK
Dynamically processing csv files based on header row
Can anyone think of a neat solution to this problem?
I've got a csv file with approximately 200 columns, which could potentially change position. The first row defines the column names and I would like to pivot the data so that I end up with one record per column along these lines:
Input file:
Key Col1 Col2 Col3
1 2 3 4
5 6 7 8
Output file:
Key Field Value
1 Col1 2
1 Col2 3
1 Col3 4
5 Col1 6
5 Col2 7
5 Col3 8
I've come up with a few ideas but I need something which is quick enough to deal with a few million rows and can handle the columns names changing.
Thanks,
Dave
I've got a csv file with approximately 200 columns, which could potentially change position. The first row defines the column names and I would like to pivot the data so that I end up with one record per column along these lines:
Input file:
Key Col1 Col2 Col3
1 2 3 4
5 6 7 8
Output file:
Key Field Value
1 Col1 2
1 Col2 3
1 Col3 4
5 Col1 6
5 Col2 7
5 Col3 8
I've come up with a few ideas but I need something which is quick enough to deal with a few million rows and can handle the columns names changing.
Thanks,
Dave
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Re: Dynamically processing csv files based on header row
No, at least not in a server job.DatastageSolutions wrote:Can anyone think of a neat solution to this problem?
In the parallel environment you might be able to do something with a set of parallel jobs - push the first row out into a schema file then use that in a subsequent job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Hi Dave,
Stepping outside of DataStage, you could use a Perl script to do this for you. If it were programmed correctly, there would be no issue with changing parameters AS LONG AS the first field is always the KEY. The Perl script would then just create a new file for you with the pivoted values. It could also be set to backup the original file to a new name and modify the data in place.
Let me know if you would have questions on this.
Bestest!
Stepping outside of DataStage, you could use a Perl script to do this for you. If it were programmed correctly, there would be no issue with changing parameters AS LONG AS the first field is always the KEY. The Perl script would then just create a new file for you with the pivoted values. It could also be set to backup the original file to a new name and modify the data in place.
Let me know if you would have questions on this.
Bestest!
Bestest!
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
-
- Participant
- Posts: 7
- Joined: Wed Sep 19, 2007 1:56 am
- Location: County Durham, UK
I would love to see your solution if you agree to share it.DatastageSolutions wrote:In the end I knocked up a basic routine which does about 100,000 rows a minute. Good enough for me!
I am writing some basic code to do some preprocessing of csv files, things like validating the number of columns, eliminating those extraneous delimiters that you get when a copy-paste is done, checking date formats and transforming it to some standard, etc.
With your solution, I could get everything combined into one nice routine for csv file preprocessing.
gateleys
How many rows are you working with? 100,000 per minute is pretty slow especially if you are working with a significant data set. I'll try to put something together for you in Perl real quick and see if I can't improve on the speed. Any chance you could send me the file winzipped? If so, send to jmiceli@wrberkley.com.
Bestest,
Bestest,
Bestest!
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
Potential solution
Hi Dave (and anyone else that may be interested),
I have a potential solution to this problem for you to try. You are welcome to use it as you see fit and modify it to suit your needs. If you don't have Perl installed, go to Activestate and get it. It is standard on any Unix/Linux install and I wrote it to work with anything as it's very simple. Other Perl experts may be able to do it in fewer lines, but I rarely have any issues with the code running.
All you have to do is put the program in the directory where your file is (or vice versa) and pass it the name of the file, the name of the file you want created to hold the pivoted data and the delimiter you used in the file. When specifying the delimiter, put it in double quotes and use standard pattern matching notation ("\t" = tab, "\n" newline, etc.).
I created a test file with ten values per key and 15400 rows in the file. This script processed it in about three seconds. Hopefully it will help you!
Bestest,
I have a potential solution to this problem for you to try. You are welcome to use it as you see fit and modify it to suit your needs. If you don't have Perl installed, go to Activestate and get it. It is standard on any Unix/Linux install and I wrote it to work with anything as it's very simple. Other Perl experts may be able to do it in fewer lines, but I rarely have any issues with the code running.
All you have to do is put the program in the directory where your file is (or vice versa) and pass it the name of the file, the name of the file you want created to hold the pivoted data and the delimiter you used in the file. When specifying the delimiter, put it in double quotes and use standard pattern matching notation ("\t" = tab, "\n" newline, etc.).
I created a test file with ten values per key and 15400 rows in the file. This script processed it in about three seconds. Hopefully it will help you!
Code: Select all
#!/usr/local/bin/perl
##########################################################################
# program name: Pivot.pl
# Author: John Miceli
# Date: 10/05/2007
#
# Purpose: To take a flat file with header row and pivot the values
# based on the first value in a line being the key to the row.
#
# Usage: perl Pivot.pl sourcefile targetfile delimiter
# ie: perl Pivot.pl route.txt p_route.txt "\t" (specifies tab)
##########################################################################
use strict;
use warnings;
my ($inputfile, $outputfile, $delimiter, $cnt, $line, $elementcnt);
my ($element, $next, $key, $field, $value);
my (@array, @processing, @headers, @hproc);
my (%positions);
#initialize stuff
$cnt = 0;
$elementcnt = 0;
$next = 0;
$inputfile = "$ARGV[0]";
$outputfile = "$ARGV[1]";
$delimiter = "$ARGV[2]";
open(IN, "< $inputfile") || die "Could not open file $inputfile for reading.\n";
open(OUT, "> $outputfile") || die "Could not open file $outputfile for reading.\n";
# print the headers to the output file using tab delimiter
print OUT "KEY\tFIELD\tVALUE\n";
while ($line = <IN>)
{
chomp $line;
$cnt++;
if ($cnt == 1)
{
@headers = split(/$delimiter/, $line);
$elementcnt = (scalar(@headers)-1); # get count of headers in file
foreach $element(@headers)
{
if ($element =~ /KEY/)
{
next;
}
$positions{$next} = $element;
$next++;
}
# get rid of the delimiter elements in the array
foreach $element(@headers)
{
chomp $element;
if ($element =~ /$delimiter/)
{
# do nothing
}
else
{
push(@hproc,$element);
}
}
next;
}
@array = split(/$delimiter/, $line);
# get rid of the delimiter elements in the array
foreach $element(@array)
{
chomp $element;
if ($element =~ /$delimiter/)
{
# do nothing
}
else
{
push(@processing,$element);
}
}
$key = shift(@processing);
# now start building the pivot
for (my $i=0; $i<= ($elementcnt-1); $i++)
{
$field = $positions{$i};
$value = shift(@processing);
print OUT "$key\t$field\t$value\n";
}
}
close IN;
close OUT;
Bestest!
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
-
- Participant
- Posts: 7
- Joined: Wed Sep 19, 2007 1:56 am
- Location: County Durham, UK
Thanks for that bit of Perl John, can't really send you the live data as its financial information but I'll definitely give it a go :D
For gateleys and anyone else who's interested, here's a chopped down example of the Basic version...
Enjoy!
For gateleys and anyone else who's interested, here's a chopped down example of the Basic version...
Enjoy!
Code: Select all
JobName="ConvertCSV"
! Open infile and outfile
OPENSEQ "infile" TO InputFile Then
Call DSExecute ("Unix", "rm outfile", CommandOutput, ReturnCode)
OPENSEQ "outfile" TO OutputFile Else CREATE OutputFile Else Call DSLogFatal("Could not create file: outfile", JobName)
! Read header line and process it
Readseq line from InputFile Else Call DSLogFatal("Could not read header", JobName)
Header = Convert(",", @FM, line)
LineLength = Dcount(Header, @FM)
Done = @FALSE
Counter = 1
Loop
Readseq line from InputFile Else Done = @TRUE
! This converts K1 K2 K3 A B C D
! to
! K1 K2 K3 A
! K1 K2 K3 B
! K1 K2 K3 C
! K1 K2 K3 D
!
Item1 = line[",",1,1]
Item2 = line[",",2,1]
Item3 = line[",",3,1]
For Item = 4 to LineLength
Value=line[",",Item,1]
Writeseq Item1:",":Item2:",":Item3:",":Value to OutputFile Else Call DSLogFatal("Error writing record", JobName)
Next
If Mod (Counter, 100000) = 0
Then
Call DSLogInfo ("Processed " : Counter, JobName)
End
Counter += 1
Until Done Repeat
End Else Call DSLogFatal("Could not open file infile", JobName)
Please let me know if it worked or if there is something I can do to improve the script. I am adding this to my directories of utilities I have written. I have also added it to the open topic I have for Perl scripts I have made available for anyone to use.
I figure maybe having a bunch in one place will help others as much as they have helped me!
Bestest!
I figure maybe having a bunch in one place will help others as much as they have helped me!
![Laughing :lol:](./images/smilies/icon_lol.gif)
Bestest!
Bestest!
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services
"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"